A team of researchers at UC Berkeley, University of Maryland, and UC Irvine conducted a study to identify that can cause instability in the GPT-3 language model. The team proposes a contextual calibration procedure that consistently improves GPT-3 accuracy across diverse prompt format choices and examples.
GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. Large language models have significantly improved their few-shot performance, with top models like GPT-3. Few-shot learning allows users to prototype NLP models swiftly. It enables non-technical users to create NLP systems and efficiently reuse models reducing system memory and complexity.
However, GPT -3’s accuracy can often be highly unstable across different prompts such as training examples, permutation, format. This instability arises from the bias of language models towards predicting specific answers, for instance, those expected in the pretraining data.
Generally, neural autoregressive language models are provided with a natural language prompt to ensure they perform few-shot learning using in-context learning. The prompt includes a format, a set of training examples, and a permutation of the training examples.
The first step was to study how GPT -3’s accuracy changes across different prompts. The team carried out sentiment analysis task experiments on three GPT-3 model sizes trained on SST-2 datasets. They observed high variance in GPT -3’s accuracy across the prompts’ training examples, permutation of samples, and format. Remarkably, varying the training examples’ permutation could cause accuracy to range from 54.3 percent to 93.4 percent (nearly State-of-the-Art).
The researchers identified that the following three factors contribute to GPT-3 instability:
- Majority Label Bias: It is known that GPT-3 is biased towards frequent answers in the prompt. The majority label bias helps demonstrate that different training examples’ different choices massively influence GPT -3’s accuracy.
- Common Token Bias: GPT-3 is biased towards outputting common tokens in its pretraining distribution. The expected token bias helps explain the importance of label names and why the model struggles with rare answers.
- Recency Bias: The model tends to repeat answers that appear towards the end of the prompt. It is called recency bias, and the model’s majority label bias is aggravated by it. It helps to clarify the importance of permutating the training examples.
These three biases together generally contribute to a simple shift in a model’s output distribution.
The researchers state that the notion of estimating model biases towards specific answers by feeding content-free inputs was fascinating. And Inspired by the idea, they proposed a new data-free contextual calibration procedure to infer parameters. To evaluate the contextual calibration’s effectiveness, they conducted tests on text classification, fact retrieval, and information extraction tasks across several datasets.
The proposed model has a remarkable ability to improve the GPT-3 model’s accuracy with more training examples and reduce the variance. This raises the average and worst-case absolute accuracy by about 30 percent. But the researchers state that the model learns some superficial patterns, such as repetition of standard answers. Therefore, they aim to thoroughly understand and analyze the dynamics of in-context learning in the future.