This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)

Transformer-based neural networks have shown great ability to handle multiple tasks like text generation, editing, and question-answering. In many cases, models that use more parameters show better performance measured by perplexity and high accuracies of end tasks. This is the main reason for the development of larger models in industries. However, larger models sometimes result in a bad performance, for example,  the 2B model MiniCPM exhibits comparable capabilities to larger language models, such as Llama2-7B, Mistral-7B, Gemma-7B, and Llama-13B. Moreover, the size of high-quality data available may not keep pace as the computational resources for training larger models increase. 

Current methods to overcome such shortcomings include Scaling laws, Energy-based models, and Hopfield models. In scaling laws, the performance of models increases when there is a scale-up in the models’ size and volume of training data. Energy-based models have become famous as a fundamental modeling tool in different areas of machine learning over the past few decades. The main idea of this method is to model the neural network using a parameterized probability density function to present the distribution in terms of a learnable energy function. The last one is the Hopfield model, in which the classical Hopfield networks were developed as an example of associative memory. 

Researchers from Central Research Institute, 2012 Laboratories Huawei Technologies Co., Ltd. introduced a theoretical framework focused on the memorization process and performance dynamics of transformer-based language models (LMs). Researchers carried out a series of experiments using GPT-2 across different data sizes to overcome the signs of saturation and, at the same time, trained vanilla Transformer models on a dataset consisting of 2M tokens. The results of these experiments validated the theoretical results, offering important theoretical insights on the optimal cross-entropy-loss that can guide and improve decision-making in model training. 

A 12-layer transformer LM is trained using the GPT-2 small tokenizer and architecture on the OpenWebText dataset. This dataset is similar to the WebText dataset used for original GPT-2 model training, which contains 9B tokens from 8,013,769 documents. Using different amounts of data, three models are trained where a subset containing the first 1% (90M) and 0.1% (9M) of the OpenWebText data is created. Further, vanilla transformer models are trained using a small amount of high-quality data that contains pairs of English sentences in declarative formation and is context-free with a vocabulary size of 68 words, where the task is to convert declarative sentences into questions.

The training with 0.1% (9M) of the OpenWebText data shows over-fitting, and the training loss disappears over iterations. This happens because the training samples are not well-separated due to which the model energy decreases to a sum of some delta functions. When the model size is about the order O(D2) and trained on 90M tokens, the model can achieve similar training and validation loss compared to the setting with 9B tokens. Two vanilla Transformers of 6 and 10 layers are trained using a batch size of 8, and the training losses stabilize at a value of around 1 as predicted in Proposition.

In conclusion, researchers presented a theoretical framework focused on the memorization process and performance dynamics of transformer-based language models LMs. In this paper, transformer-based networks are modeled using associative memory, and cross-entropy loss is highlighted for model and data sizes. Also, experiments are carried out by (a) utilizing GPT-2 of different data sizes and (b) training vanilla Transformer models on a dataset of 2M tokens. Finally, a global energy function is created for the layered structure of the transformer models using the majorization-minimization technique.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...