Meta AI Builds A Massive AI Research Supercomputer To Advance The Field Of Machine Learning And Data Science Research

In recent years, Meta has made a significant amount of contribution in the field of self-supervised learning and transformer-based models. Many of these AI models have revolutionized various domains such as computer vision, natural language processing, automatic speech recognition, etc. Adding improvements and training such models with a massive number of parameters for utilizing the vast amount of data that is made available daily will require new secure and reliable AI supercomputers capable of bringing down the training time by a critical factor.

To address the need for massive computational power needed to train large AI models, Meta uses NVIDIA’s technology intending to develop advanced supercomputers capable of running quadrillions of operations per second. Meta and NVIDIA have built the AI Research SuperCluster (RSC), one of the fastest supercomputers, and is believed to become the fastest supercomputer during its completion in mid-2022. Researchers at Meta have already started to use the AI RSC to train their state-of-the-art AI models with well over trillions of parameters.

At present, the AI supercomputer uses 760 NVIDIA DGX A100 AI infrastructure systems as its computation nodes. Each system contains 8 NVIDIA A100 GPUs linked with the Quantum 200 Gb/s InfiniBand network. By the time the RSC is complete, the number of total GPU endpoints is planned to increase from 6080 to 16,000. This AI RSC will be the largest supercomputer network deployed to date, and it will also be equipped with a storage system capable of serving 16TB/s of training data. Comparison with Meta’s previous 2017 22,000 NVIDIA V100 Tensor Core GPUs based benchmark shows that the currently deployed RSC runs computer vision pipelines around 20 times faster. It also runs NVIDIA’s NCCL nine times faster and runs emerging large-scale NLP models three times faster.

Meta’s RSC has privacy and security principles at the core of its design principle. Meta has focused on the utility of a supercomputer platform capable of training AI models by the data uploaded into their services regularly. To achieve this goal, the design of AI RSC ensured that the data was anonymized and encrypted in all the stages except for the model training stage.

The applications of such an AI supercomputing platform are immense. Researchers at Meta will be able to deploy very large-scale AI models having trillions of parameters to learn from the data that are daily uploaded on their services in a secure manner. There are applications in the processing of high-quality and frame-rate videos, real-time text translations, development of new AR platforms, real-time language processing that would allow people to communicate seamlessly, etc.

By the middle of 2022, when Meta and NVIDIA finish setting up the RSC platform, it will be the fastest AI supercomputer in the world capable of carrying out five exaflops. An overall improvement factor of 2.5 times is expected in AI training tasks. The AI RSC will help in the future development of the metaverse and will also enable scientists from the broader AI community to carry out ground-breaking research.

References:

  • https://ai.facebook.com/blog/ai-rsc/
  • https://blogs.nvidia.com/blog/2022/01/24/meta-ai-supercomputer-dgx/

Archishman Biswas is currently an 4th year undergraduate pursuing Dual Degree in Electrical Engineering at the Indian Institute of Technology, Bombay. His specialization is in the field of Communication Systems and Signal Processing. He is interested in recent and emerging Deep Learning architectures. He is enthusiastic about exploring the applications of Deep Learning and other AI techniques in fields of Image Processing and Computer Vision.