Deep learning’s success is due to its capacity to tackle some enormous non-convex optimization problems with relative simplicity. Even though non-convex optimization is NP-hard, simple algorithms and generic versions of stochastic gradient descent perform surprisingly well in fitting massive neural networks in reality. After accounting for all conceivable permutation symmetries of hidden units, they conclude that neural network loss landscapes contain (almost) a single basin. They provide three strategies for permuting the branches of one model to align them with the units of a reference model. This transformation yields a collection of comparable weights in an approximately convex basin near the reference model.
They study the irrational efficacy of stochastic gradient descent (SGD) algorithms on deep learning’s high-dimensional non-convex optimization problems. Three questions mainly drive them:
1. Why does SGD flourish in high-dimensional non-convex deep learning loss landscape optimization while being substantially less resilient in other non-convex optimization situations such as policy learning, trajectory optimization, and recommender systems?
2. Where can we get all of the local minima? Why does the loss drop gradually and monotonically while linearly interpolating between initialization and final trained weights?
3. How do two independently trained models with different random initializations and data batch ordering always provide similar results? Why do their training loss curves resemble one another?
This third aspect, they believe, implies the presence of some as-yet-undiscovered invariance(s) in the training dynamics, such that successive training runs exhibit almost identical features. Researchers uncovered hidden unit permutation symmetries in neural networks. In a nutshell, one may swap any two units of a hidden layer in a network, and the network will continue to function normally if the weights are changed accordingly. Researchers recently proposed that these permutation symmetries will allow us to link locations in weight space linearly and without loss.
The hypothesis is that most SGD solutions belong to a set whose components may be permuted, so linear interpolation between any two permuted elements is not hindered. Such solutions are referred to as linear mode connected (LMC). In this study, they try to figure out what invariances are causing these three occurrences and the phenomenal success of SGD in deep learning. If correct, the hypothesis can significantly broaden their knowledge of how SGD works in the context of deep understanding and provide a reasonable explanation for these three phenomena in particular.
The following are their contributions:
1. Methods of matching They offer three unique methods based on notions and techniques from combinatorial optimization to align the weights of two independently trained models. They show complex findings for these problems and provide approximation techniques where suitable. On current technology, their quickest approach recognizes permutations in seconds.
2. Relationship with the SGD They show, using a counterexample, that linear mode connectivity is an emergent aspect of SGD training rather than model topologies. They relate this finding to previous research on SGD’s implicit biases.
3. Experiments with realistic ResNets, including zero-barrier LMC. Studies using MLPs, CNNs, and ResNets trained on MNIST, CIFAR-10, and CIFAR-100.
They show the single basin phenomena experimentally across a wide range of model architectures and datasets, including the first (to their knowledge) demonstration of zero-barrier linear mode connection across independently trained ResNet models on CIFAR-10 and CIFAR-100.
In addition, they discover fascinating relationships between model breadth, training duration, and mode connectivity across a wide range of models and datasets. Finally, they explore the limitations of a single basin theory and a counterexample to the linear mode connection hypothesis. They present the first-ever demonstration of zero-barrier LMC on non-trivial datasets between two independently trained ResNet models. They investigate the link between LMC, model width, and training time. Finally, they demonstrate their approaches’ ability to integrate models trained on separate datasets into a merged model that outperforms both input models while using no more computation or memory than either.
Ainsworth, S. K., Hayase, J., & Srinivasa, S. (2022). Git Re-Basin: Merging Models modulo Permutation Symmetries. arXiv. https://doi.org/10.48550/arXiv.2209.04836
Please Don't Forget To Join Our ML Subreddit
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.