The transformer architecture has become a go-to choice for representing various domain structures. The empirical inductive biases of the transformer make it a good candidate for scaling. This paves the way for the periodic training and release of expanded versions of existing, smaller models. Although often a scaled-up version of their smaller counterparts, new instances of such models are normally trained from the start. Since even the smallest models need a significant amount of computational resources to train, the parameters of smaller pretrained models should be used to speed up the training of larger models.
When looking at this issue from the perspective of model growth, one strategy is to use the pretrained parameters of a smaller model to initialize some of the parameters of the larger model. Recent research has shown that training can be accelerated by copying a subset of the pretrained parameters to initialize the new parameters and then fine-tuning the entire network. This contrasts earlier works, which generally froze the parameters initialized from the pretrained model and only trained the new (randomly initialized) parameters.
The Computer Science and Artificial Intelligence Laboratory (CSAIL) suggests using pre-trained, smaller language models to boost the effectiveness of these training approaches at a reduced cost and time commitment. Their approach uses machine learning to “grow” a more complex model from a simpler one to encode the smaller model’s prior knowledge. This allows for the larger model to be trained more quickly. The team doesn’t just throw away old models but takes their best parts and uses them to create something new.
Compared to methods that involve training a new model from scratch, their approach reduces the computational time and effort needed to train a big model by around 50%. In addition, the MIT method produced models with the same or higher performance as those produced by other methods that employ smaller models to expedite the training of larger models.
Time savings in training large models could positively impact research efficiency, cost, and environmental sustainability by cutting down on carbon emissions produced during the training process. This could also allow smaller research groups to access and collaborate with these enormous models, which could pave the way for numerous new developments.
The proposed strategy is called Learned Linear Growth Operator (LiGO), which expands a network’s breadth and depth based on a smaller network’s characteristics and empirical evidence. Researchers utilize ML to discover a linear mapping of the simplified model’s parameters. As a mathematical procedure, this linear map takes as input the parameters of the smaller model and produces as output the parameters of the larger model.
Researchers may desire to create a model with a billion parameters, but the smaller model may be rather vast (maybe it has a hundred million parameters). To make the linear map more manageable for a machine-learning system, the LiGO method segments it.
LiGO is superior to alternative strategies because it grows in both width and depth at the same time. They also highlight that inputting the smaller model and its specifications allows users to adjust the larger model’s width and depth to their liking.
Their solution outpaced all baselines, including training a brand-new model from the start and model-growth approaches. Their strategy reduces the computational costs of training vision and language models by around 50%, with many cases seeing a performance improvement. The team also discovered LiGO was possible even without a smaller, pretrained model to speed up transformer training. They hope to use LiGO on even more complex models in the future.
Check out the Paper, Project, and Reference. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.