Open-source hardware and software have been indispensable to growing industries’ demand to scale innovations, especially in AI. Almost every industry is adopting AI in the services and products they offer. As artificial intelligence (AI) and machine learning (ML) models become more advanced and capable, we will require more powerful hardware to keep up with them.
Meta is releasing open innovations to help solve industry-wide challenges and advance AI to better equip the data centers for the industry’s massive AI workloads. This includes new platforms for training and running AI models, power and rack innovations to help data centers handle AI more efficiently, and new developments with PyTorch.
The rising costs of power and the subsequent need for advances in liquid cooling are the main drivers behind Meta’s efforts to reevaluate every aspect of its platform, rack, power, and data center design.
Continuing from their prior Zion-EX platform, Meta introduces Grand Teton, their next-gen GPU-based hardware platform. To better support memory-bandwidth-constrained workloads at Meta, Grand Teton has been built with more computational capacity. Because of its increased compute power envelope, Grand Teton is also well-suited for compute-bound workloads like content understanding.
The original Zion platform has three separate pieces—the CPU head node, the switch sync system, and the GPU system—and relies on external cables to link them together. Grand Teton consolidates these functions into a single chassis, improving overall performance, signal integrity, and thermal performance.
Grand Teton’s rapid scalability and improved dependability are made possible by its high degree of integration, which makes its rollout easier and reduces the number of moving parts. This makes it ideal for inclusion in data center fleets.
Meta’s latest Open Rack hardware release brings a standard rack and power architecture to the market. Open Rack v3 (ORV3) was created with scalability in mind, with a frame and power infrastructure that can accommodate a wide variety of use cases, including Grand Teton, to help bridge the gap between current and future data center requirements.
The ORV3 power shelf is not secured to the busbar. As an alternative, the power shelf can be installed wherever there is space in the rack. 48VDC output will support future AI accelerators’ increased power transmission requirements, and multiple shelves can be installed on a single busbar to support 30kW racks.
The battery backup unit has also been upgraded, increasing its runtime to 4 minutes from 90 seconds in the previous model and providing 15 kilowatts of power per shelf. This backup unit, like the power shelf, can be installed wherever you like within the rack for maximum flexibility, and when used in pairs, it can supply 30kW.
The ORV3 design started with OCP development for nearly all Meta’s chosen components. The design process for an ecosystem-led design can be longer than for a traditional in-house design. Still, the result is a comprehensive infrastructure solution that is scalable, flexible, and interoperable across a wide range of providers.
The thermal management burden grows more complicated with the increase in socket power. The ORV3 ecosystem was made to support multiple liquid cooling methods, such as using air as a cooling medium and water from the building’s plumbing. Easy servicing and installation of IT hardware are made possible by the ORV3 ecosystem’s optional blind mate liquid cooling interface design, which provides dripless connections between the IT hardware and the liquid manifold.
As mentioned by the Meta team in one of their articles, their next-gen storage platform, Grand Canyon, will include enhanced hardware security and upgrades to essential commodities in the near future.
They also announced their separation from PyTorch Foundation under the Linux Foundation’s wing in September this year. The organization will support PyTorch through events like conferences and educational workshops. They aim to increase the use of artificial intelligence tools across industries by supporting a community of vendor-agnostic PyTorch projects.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.