Deep Learning For Large-Scale Biomolecular Dynamics: Harvard Research Scales A Large, Pretrained Allegro Model On Various Systems

Computational biology, chemistry, and materials engineering rely on the ability to anticipate the time evolution of matter on the atomic scale. While quantum mechanics rules the vibrations, migration, and bond dissociation of atoms and electrons on a tiny level, the phenomena that govern observable physical and chemical processes often occur on considerably greater lengths- and longer time scales. Innovation in both highly parallelizable architectures with access to exascale processors and quick and highly accurate computational ways to capture the quantum interactions is required to bridge these sizes. Current computer approaches cannot probe the structural complexity of realistic physical and chemical systems, and the duration of their observable evolution is too long for atomistic simulations. 

There has been a lot of research into MLIPs (machine learning interatomic potentials) over the past two decades. Learned energies and forces from high-precision reference data are used to power MLIPs, which scale linearly with the number of atoms. The earliest attempts used a Gaussian Process or a simple neural network in conjunction with manually crafted descriptors. Early MLIPs had poor predictive accuracy because they couldn’t generalize to data structures that weren’t present in training, leading to fragile simulations that couldn’t be used elsewhere. 

New research from the Harvard lab demonstrates that biomolecular systems with as many as 44 million atoms can be modeled with SOTA precision using Allegro. The team used a large, pretrained Allegro model for systems with atom counts ranging from 23,000 for DHFR to 91,000 for Factor IX, 400,000 for cellulose, 44,000,000 for the HIV capsid, and beyond 100,000 for other systems. A pretrained Allegro model with 8 million weights is used, with a forced error of only 26 meV/A achieved by training on 1 million structures with hybrid functional accuracy on the fantastic SPICE dataset. Fast exascale simulations of previously unimaginable swaths of material systems are possible thanks to the potential of learning the whole sets of inorganic materials and organic molecules at this data scale. This is a very huge and powerful model, with 8 million weights.

To undertake active learning for the automatic building of training sets, the researchers showed that it is possible to efficiently quantify the uncertainty of deep equivariant model predictions of forces and energy. Since equivariant models are accurate, the accuracy bottleneck is now in the quantum electron structure computations required to train MLIPs. Since Gaussian mixture models may be easily adapted in Allegro, it will be possible to run large-scale uncertainty-aware simulations with a single model instead of an ensemble.

Allegro is the only scalable approach surpassing traditional message-passing and transformer-based designs. Across various big systems, they show top speeds of over 100 steps/second and the results scale up to more than 100 million atoms. Even at the large scale of 44 million atoms of the HIV capsid, where faults are generally considerably more obvious, the simulations are stable over nanoseconds out of the box. The team incurred almost no problems throughout production.

To better understand the dynamics of huge biomolecular systems and the atomic-level interactions between proteins and medicines, the team hopes their work will pave the way for new avenues in biochemistry and drug discovery.

Check out the Paper. Don’t forget to join our 20k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

πŸš€ Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...