Understanding Conversational Evolution GPT: Part 2 — GPT-2 and GPT-3 | by Shirley Li | January, 2025

A Paradigm Shift Toward Transient Fixation
In our previous article, we revisited the core concepts in GPT-1 and what inspired it. By combining a pre-trained and Transformer-decoder-only language model, GPT-1 revolutionized the field of NLP and did. pre-training and fine-tuning standard paradigm.
But OpenAI didn't stop there.
Instead, while trying to understand why the language model pre-training Transformers works well, they began to notice the non-verbal behavior of GPT-1, when the pre-training continued, the model was able to gradually improve its performance in the tasks. poorly prepared, indicating that prior training can improve its ability to shoot the egg, as shown in the image below:
This prompted a paradigm shift from “pre-training and fine-tuning“him”pre-training only”or in other words, a task-agnostic pre-trained model that can handle different tasks without correction.
Both GPT-2 and GPT-3 are designed following this philosophy.
But why, you might ask, isn't it pre-training and fine-tuning magic does it work just fine? What are the additional benefits of bypassing the maintenance stage?
Limitations of Finetuning
Finetuning works well for some well-defined tasks, but not all, and the problem is that there are many tasks in the NLP domain that we haven't had a chance to try yet.
For those jobs, the maintenance stage requirement means we'll need to collect a reasonably sized training dataset for each new job, which is obviously not ideal if we want our models to be really smart one day.
Meanwhile, in other works, researchers have noticed that there is an increasing risk of using spurious correlations in calibration data as the models we use become larger and larger. This creates a paradox: the model needs to be large enough to absorb as much information as possible during training, but fine-tuning such a large model on a small, sparsely distributed dataset will make it difficult to fit the distribution without the distribution. samples.
Another reason is that, as humans, we don't need supervised data sets to learn many linguistic functions, and if we want our models to be useful one day, we'd like them to have such flexibility and generality.
Now perhaps the real question is, what can we do to achieve that goal and go beyond development?
Before getting into the details of the GPT-2 and GPT-3, let's first look at three key factors that influenced their model design: task-agnostic learning, the scale hypothesis, and context-based learning.
Task-agnostic Learning
Task-agnostic learning, also known as Meta-Learning or Learning to Readrefers to a new paradigm in machine learning where a model develops a broad set of skills during training, and then uses these skills during decision-making to quickly adapt to a new task.
For example, in MAML (Model-Agnostic Meta-Learning), the authors showed that models can adapt to new tasks with very few examples. Specifically, during each inner loop (highlighted in blue), the model first samples a task from the set of tasks and performs several steps of gradient descent, resulting in a transformed model. This modified model will be tested on the same task in the outer loop (highlighted in orange), and the loss will be used to update the model parameters.
MAML shows that learning can be more regular and flexible, with a guideline to go beyond individual task preparation. In the following figure the authors of GPT-3 explained how this idea can be extended to language models of learning when combined with in-context learning, the outer loop is multiplied by various functions, while the inner loop is defined using. learning within contentwhich will be explained in detail in the following sections.
The Scale Hypothesis
As the most influential idea behind the development of GPT-2 and GPT-3, the scale hypothesis refers to the observation that when training with large data, large models can develop new skills somewhat without clear supervision, or in other words, emergencies abilities that can happen when we go up, like what we saw in the zero shooting abilities of the previously trained GPT-1.
Both GPT-2 and GPT-3 can be considered experiments to test this hypothesis, with GPT-2 set to test whether a large model pre-trained on a large dataset can be directly used to solve distributed functions, and GPT-3 set to test whether Can learning content bring an improvement over GPT-2 if upgraded.
We will discuss in more detail how to implement this idea in the following sections.
In-Content Learning
As we show in Figure 3, in the context of language models, in-context learning refers to the internal loop of the meta-learning process, where the model is given natural language instruction and a few displays of the task at a given time, and is expected to complete that task by automatically discovering patterns in the given displays.
Note that in-content learning occurs in the assessment phase with no gradient update performedwhich is completely different from regular maintenance and very similar to how people do new jobs.
In case you are not familiar with the terms, exhibitions usually refers to instance pairs of entries related to a particular function, as we show in “examples” part in the picture below:
The concept of learning within content was explicitly explored in GPT-2 and then more formally in GPT-3, where the authors defined three different settings: zero-shot, one-shot, and few-shot, depending on how many shows were given to the model.
In short, task-agnostic learning highlights the potential to transcend fixation, while the scale hypothesis and context-based learning suggest a practical approach to achieving that.
In the following sections, we will go through more details of GPT-2 and GPT-3, respectively.