Baidu Researchers Propose ‘HETERPS’ for Distributed Deep Learning with Reinforcement Learning-Based Scheduling in Heterogeneous Environments

Deep Neural Networks (DNNs) have succeeded greatly in several fields, including advertising systems, computer vision, and natural language processing. Large models with lots of layers, neurons, and parameters are often trained using plenty of data, which significantly boosts final accuracy. For example, the Click-Through Rate (CTR) prediction model, BERT, and ERNIE use a lot of parameters; for instance, BERT uses between 110 million and 340 million parameters. Large models often consist of layers that are both data- and compute-intensive. CTR models, for instance, handle highly dimensional input data.

The input data are high-dimensional and include a lot of sparse characteristics. A low-dimensional embedding is produced by processing a tiny percentage of non-zero data via an embedding layer, referred to as light features. The embedding layer handles enormous volumes of data, such as 10 TB or even more, which results in high input/output (IO) costs and data-intensive processing. However, because of high computer demands, several additional deep neural network layers, such as fully-connected layers, have computationally costly training processes. For the distributed training of large-scale DNN models, it is essential to fully use heterogeneous computing resources as processing units, such as CPUs, various kinds of GPUs, and AI processors, grow more heterogeneous.

Data-intensive activities are preferred by some computing resources, such as CPUs, whereas compute-intensive jobs are selected by others, such as GPUs. For dispersed training in this situation, the scheduling of activities and different computer resources is crucial. Despite the scheduling problem being a classic NP-hard problem, there are already some simple solutions. For instance, the first layer in this study may be scheduled to CPUs, whereas the remaining layers can be planned to GPUs because it typically deals with large volumes of data. This approach might not work for different DNN structures since not all DNN models have the same structure. While Genetics and Greedy may fall into the local optimum, which equates to high cost, they may be immediately applied to solve the layer scheduling problem. Additionally, Bayesian Optimization (BO)-based scheduling can be used as a black-box optimization technique. However, BO may experience considerable unpredictability, which sometimes equates to high costs. While pipeline parallelism is emerging as a potential method to handle big DNN models, data parallelism is frequently utilized to parallelize the training process of large-scale DNN models. Parallelism can quicken the training process after assigning the jobs to the appropriate heterogeneous computer resources.

To achieve fine-grained parallelism, data parallelism and pipeline parallelism can be coupled. The training data is divided to match the number of computing resources when using the data parallelism strategy. Each computer resource uses the same DNN model to handle a separate portion of the data sets. In the pipeline technique, each stage of the DNN model may be parallelized as each computer resource processes the training data with a location of the model. A DNN stage comprises several continuous layers, and two distinct stages may have data dependencies where one stage’s result serves as the input for the other stage.

While using numerous computational resources might result in a higher cost, parallelism shortens the training period. The training procedure often has a fixed throughput limit to train a DNN model in a reasonable amount of time. Therefore, it is advantageous to reduce financial expenses with the throughput restriction. The elasticity of the computing resources may be used to ensure the throughput constraint while lowering the economic cost since the number of computing resources may scale up or down on demand. The choice of how many computer resources to utilize for the distributed training in this situation is crucial.

They suggest the Paddle-Heterogeneous Parameter Server in this research using elastic heterogeneous computing resources to enable distributed training of large-scale DNN. The three components that make up Paddle-HeterPS are the DNN layer scheduling module, the data management module, and the distributed training module. The DNN layer scheduling module generates a scheduling plan and a provisioning plan. While the scheduling plan assigns each layer to the appropriate kind of computing resources, the provisioning plan specifies the number of computing resources of each type needed for the distributed training process. The data management module manages the movement of data across several servers or clusters. A cluster is a collection of linked computer assets.

The distributed training module parallelizes the model’s training process by combining data parallelism and pipeline parallelism. The scheduling module suggests a DNN layer scheduling approach to use heterogeneous computing resources. Multiple layers in a DNN model could each have unique properties, such as being data- or compute-intensive. They allocate each layer to the appropriate computer resource, such as specific CPUs or GPUs, to shorten training times. A fully-connected layer is frequently compute-intensive due to its high processing burden, but an embedding layer is typically data-intensive. Then, they combine numerous subsequent layers into a scheduled stage for the same kind of computing resources to shorten the time it takes to transport data across various computer resources. A scheduled plan is created in this manner. Then, to perform load balancing and lower the cost while still fulfilling the throughput restriction, they construct a provisioning plan to alter the number of computing resources of each kind. They use pipeline and data parallelism to parallelize the training process. 

Following is a summary of their key contributions:

• To allow the distributed training of large-scale DNN with elastic heterogeneous computing resources, they present a system known as PaddleHeterPS. The framework controls data sharing across dispersed computer resources and their storage. 

• To schedule each layer to the right sort of computing resources while reducing the overall cost and assuring throughput, they present a reinforcement learning-based layer scheduling approach. They also provide a way to choose the appropriate amount of computing resources for distributed training based on the scheduling strategy. 

• They run extensive experiments based on DNN models with various structural variations to demonstrate the benefits of their approach in comparison to standard approaches.

Check out the paper and code. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.