Researchers from KAUST (King Abdullah University of Science and Technology) have found a new way to remarkably increase the training speed of machine learning and deep learning models. Wide-ranging machine learning models can be exceptionally trained faster by detecting how often zero results are obtained in distributed machine learning which uses large training datasets.
“Intelligence” in AI models is developed by training on labeled datasets to tell the model how to differentiate between individual inputs and respond accordingly. To get a better performing model for whatever task it has been assigned, we need to feed more labeled data. Complex deep learning applications like self-driving vehicles require extensive input datasets and long-term training, even when using powerful parallel supercomputing platforms, which are very expensive.
While training the model, small learning tasks are allocated to hundreds of computing nodes that share their results before running the next job over a communication network. The most significant source of computing projects in parallel computing tasks is the communication between computing nodes at each model step.
Jaiwei Fei from the KAUST team described that communication is a major performance gridlock in deep learning. While the fast-paced increase in model size, the researchers also saw the increment in the proportion of zero values developed in the course of the learning process, called sparsity. The idea was to use this sparsity to increase the practical bandwidth usage by sending only non-zero data blocks.
SwitchML is a program that KAUST developed earlier. This program optimized internode communication by running structured aggregation code on the network switches that process data transfer. Fei, Marco Canini, and their colleagues proceed a step further by recognizing zero results and produced a process to eliminate transmission without interspersing the harmony of the parallel computing process. It does this and offload part of the computational load and significantly reduces the amount of data transmission.
The research group proposes OmniReduce, a streaming aggregation system that sends only non-zero data blocks. This idea is beneficial and accelerates distributed training by up to 8x. At 100 Gbps, OmniReduce delivers 1.4–2.9 times better performance for network-bottlenecked DNNs than other systems can offer.
The team exhibited their OmniReduce program on a grid-enabled testbed consisting of an array of the graphics processing unit (GPU) array. It gained an extremely high speed-up for a typical deep learning program. The research team is now planning to adapt OmniReduce to run on programmable switches using in-network computation to improve performance further.
Benefits of OmniReduce:
- Computational and space complexity are not affected by the number of nodes in a system. This allows OmniReduce to scale better than previous approaches, which were fundamentally limited because they masked aggregation latency with pipelining.
- The sparsity of input data is proportional to acceleration. At the same time, OmniReduce does not require that it be sparse for benefits. In an extreme case where dense data sets are used instead, OmniReduce can still provide comparable results as AllReduce.
- OmniReduce’s streaming aggregation algorithms can be used in a variety of ways, depending on the type of data input. The aggregator component is either run as an independent server resource (cheaper than worker nodes equipped with GPUs), co-located on worker nodes or even directly via network switches like Mellanox SHARP, SwitchML and ATP.