New AI Research from the University of Maryland Investigates Cramming Challenge for Training a Language Model on a Single GPU in One Day

In many areas of natural language processing, including language interpretation and natural language synthesis, large-scale training of machine learning models utilizing transformer topologies has produced ground-breaking advances. The widely acknowledged behavior of these systems is their ability to stably scale or to continue to perform better as the number of model parameters and the volume of data increase. 

While the majority of the studies are focused on finding new ways to push the boundaries of extreme computation, a team of researchers at the University of Maryland is looking into the best ways to scale back language model training and the trade-offs that may occur.

Researchers believe they can train a language model because of the competition to construct enormously large models that the power of scale has sparked. The initial BERT model is used for many real-world applications in natural language processing. However, this model already needed a substantial amount of computing to train.

With relatively limited resources, it is possible to train a language model to BERT’s performance level, which has a number of intriguing consequences. One reason is that it opens up a wide range of additional academic inquiries that are currently difficult to achieve for large-scale models if scaled-down model pretraining is a viable counterpart of large-compute pretraining. According to researchers, there may come scenarios where a practitioner is interested in retraining their language models utilizing a specialized or reliable data source. Still, legal considerations make it unclear if models trained on public data with questionable origin are acceptable.

The new study by researchers at the University of Maryland explores the “Cramming” challenge—learning an entire language model the day before the test. Their study proves that performance closely adheres to the scaling rules found in large-compute environments, even in this confined situation. To determine whether changes to the training pipeline lead to better performance in the scaled-down situation, this research first looks into various training pipeline aspects. 

Scaling down is challenging. While faster gradient computations are made possible by smaller model designs, overall rates of model improvement over time are almost constant. However, modifications to the training recipe that take advantage of scaling laws can produce gains by increasing the effective rate of gradient computations without reducing the model size. Ultimately, the team was able to train models on a tight budget and deliver respectable performance, frequently approaching and occasionally even surpassing BERT on GLUE tasks.

The team evaluates the performance when a transformer-based language model is packed into a situation with very little computation. They discover that multiple strands of change result in respectable downstream performance on GLUE. The team hopes this work can serve as a starting point for investigations into the cramming question and shed additional insight on several enhancements and strategies. 

Check out the Paper and Github. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...