LAN is the network that connects client computers to servers, a piece of information we are all aware of, but what most people aren’t aware of is the second network. A network that will work behind the LAN, a scale-out network that will run deep learning programs, the ones that need thousands of GPUs to be trained.
Making NVIDIA, the most prominent GPU supplier has the technology to connect those GPUs using the InfiniBand technology it acquired by buying Mellanox in 2020. The problem comes when cloud sourcing giants get involved like Google and Amazon. The problem comes because these companies earn through cost-cutting when it comes to scaling large networks, which automatically dominates that multiple companies have to be involved, which makes the situation a little complicated.
As another player in the market, BROADCOM launched their Tomahawk 5/BCM78900 series as their version interconnecting GPUs. It is a network switching device that connects different devices with a bandwidth of about 51.2 TB, which closes the gap between the latency of NVIDA’s devices and network switching devices connected to ethernet. Latency is the time taken for the first data bit from point A to point B. Because of a considerable reduction in latency, this open engagement takes an edge over NVIDIA. The technology that takes away the advantage from Infiniband is known as ROCE.
BROADCOM consists of multiple well-funded startups with the help of Google and Amazon that want to build their GPUs but don’t have InfiniBand fabric. According to BROADCOM, as the latency will decrease over time, the weakness of InfiniBand will be exposed, which is scalability. According to the company, one of the main advantages ROCE has on InfiniBand is that it can also connect with Intel and AMD CPUs, so collapsing the networking technology into one approach has certain economic advantages. In the future, the market will be about 50-50 divided between CPU and GPU because the same technology used for CPU interconnects will be used for GPU interconnect. Another dynamic that comes is that GPU takes more bandwidth while the CPU will be taking more ports on an ethernet switch. As CPUs are sold more than GPUs, there will be a normalization because of this dynamic.
In conclusion, there will be a market for the other network for deep learning programs. Now that NVIDIA is not alone in the market, it will be interesting to see how the future demand and cloud computing will change regarding GPUs and CPUs as the cloud giants want alternate options for InfiniBand to create their GPUs. NVIDIA, which already makes the state of the art GPUs, is a bit ahead of them.
Please Don't Forget To Join Our ML Subreddit
A Machine Learning enthusiast who loves to research and get to know new and latest technologies like AlphaFold, DeepMind AlphaZero etc. that are the best AI in their respective fields and I am very excited what the future of AI and how we will implement it in our daily life