Transformer-based models are one of the most advanced and sophisticated classes of models present in the current day. It is plausible to infer that these models are capable of bringing about a paradigm shift in the rapidly developing field of AI given their vast array of use cases, such as generation tasks in natural language processing (NLP), text-to-image based tasks, 3D protein structure prediction, etc. Additionally, large language models (LLMs) have proved to be the most successful and effective application of transformer-based models. Their usage has also exponentially increased over the past few years as researchers continue to dive deeper into larger and more sophisticated architectures. However, even though these models are widely adopted, there is little knowledge about how and why these models work so well. This is where understanding how LLMs evolve over the course of training comes into play. Moreover, prior research has demonstrated that certain approximated regular patterns are visible when a language model scales, but linking these patterns in a way that considers how a trained model scales is still uncharted territory. One of the primary reasons behind this is the lack of access to publicly available LLMs that meet all the requirements of the researchers.
In order to propose a solution to this problem statement, a non-profit AI research group, Eleuther AI, recently unveiled Pythia, a collection of 16 LLMs trained on public data in the same order designed specifically to facilitate scientific research. Currently, Pythia is the only publicly available model suite that includes models that were trained on the same data in the same order, and these models span over several orders of magnitude in scale. The team has released 154 checkpoints for each of the 16 models, and the LLMs range in size from 70M to 12B parameters. Moreover, all the corresponding data and tools to download and replicate the exact training process are publicly released to facilitate further research. These key properties helped the researchers behind Pythia to conduct different experiments to understand how gender bias, memorization, and few-shot learning are affected by training data and model scale.
Currently, there is no collection of models that is accessible to the general public, follows a well-established training process, and maintains uniformity between scales. This is where the Pythia researchers did groundbreaking work. As previously indicated, all models are publically accessible and were trained using the Pile dataset, a collection of English-language data popularly used to develop LLMs (particularly large autoregressive transformers). The researchers have designed Pythia in such a manner that all intermediate checkpoints are available for analysis. This makes it possible for the researchers to link the data-driven progress to a particular checkpoint. Additionally, the training process and the hyperparameters are thoroughly documented to support future research.
The primary goal of Eleuther AI behind developing Pythia is to empower future scientific research on understanding the capacities and overcoming limitations of large language models. For this purpose, the researchers primarily focused on three case studies, mitigating gender bias, memorizing in large language models, and the term frequency impacts on few-shot performance to demonstrate Pythia’s experimental methodology. Through their experiments, the researchers concluded that this highly controlled setup could be used to yield novel insights into LLMs and their training dynamics. The researchers went on to say that it would not have been possible to perform these case studies for language modeling research using any pre-existing model suites.
In conclusion, Eleuther AI’s Pythia is a collection of LLMs trained with consistent data ordering and model architecture that spans across multiple orders of magnitude of scale. Their research primarily focuses on three case studies that show how Pythia may be utilized to enable experiments at previously unheard-of levels of detail for a public model suite. These case studies center on gender debiasing, memorizing, and term frequency effects. The researchers have high hopes that their findings and analysis will stimulate additional investigation into how language models change throughout training and how different model sizes can be related to varied estimated patterns observed during training.
Check out the Paper and Gitub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 18k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.