June 18, 2023
4 min read
LLM's (Large Language Models) are all the rage currently, thanks to ChatGPT. In this post we draw your attention to three foundational papers between 2017-2023 that we think every Data Scientist must read.
Ever since its November 2022 release, ChatGPT has taken the world by storm. Given the fast rate of change in this space, one might find it hard to keep up with the pace of iterative development. This is especially true for those who are new to this field or are out of touch with the progress in this domain.
While there have been many important contributions that have made ChatGPT a reality, there are essentially three papers published in the last six years that have had an outsized impact in making LLM's exciting for the world.
So what can you expect to learn from reading these papers? We thought it would be interesting to show a high level overview of each paper with the help of a word cloud.
The "Transformer" paper, officially published as "Attention Is All You Need" made the following important contributions:
attention
i.e. the ability of a model to pay attention to the most important parts of an input. Moreover, the transformer model does not require sequences of data to be processed in any fixed order making them parallelizable and hence efficient to train.encoder
and decoder
architecture that utilizes positional encodings
and multi-head attention
.The "BERT" paper, officially published as "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" made the following important contributions:
Following-up on it's success of the transformer paper, Google next released the "BERT" model that is based on the transformer architecture. The key difference compared to transformers is that a BERT model only utilizes "encoders" and does not use any "decoders"
.
BERT models have proven to be very efficient at several NLP tasks such as:
The "LLaMA" paper, officially published as "LLaMA: Open and Efficient Foundation Language Models" made the following important contributions:
LLaMA model only utilizes "decoders" and does not use any "encoders". This is also how ChatGPT works.
One interesting characteristic of all these models is that they are all considered "foundational" & "general purpose" as they are trained on a large dataset of unlabeled data. Many interesting applications come from fine-tuning these models for specific tasks using a machine learning technique known as "transfer learning". See the wildly popular "HuggingFace" community for examples of such fine-tuned models.