Horovod announces its v0.21 and brings many powerful new features to the Horovod community, making training deep learning models faster and easier than ever before.
Horovod was open-sourced in 2017, and it has grown to become the standard solution for scaling deep learning training to hundreds of GPUs. Horovod can reduce training times from days or weeks to hours or minutes by adding a few lines of Python code to an existing TensorFlow, PyTorch, or Apache MXNet training script.
- Added support for an optimization called Local gradient aggregation for TensorFlow v1 and v2 by Determined AI.
- Grouped allreduce that reduces latency and improves determinism contributed by Nvidia.
- Supports easy provisioning of Elastic Horovod jobs on Ray contributed by Anyscale.
- Support for Horovod Spark Estimators in the Databricks Runtime by Databricks.
Local Gradient Aggregation
Local gradient aggregation is a method to reduce communication overhead in network bandwidth-constrained settings. It enables GPU to process batches much faster than a network can transmit the gradients for aggregation across workers. This method works by accumulating gradient updates to the model locally in GPU memory, thus summing the latest updates with the existing accumulated updates since the last round of communication. After configuring a number of mini-batches, N, Horovod, the stand-alone Python package, performs an allreduce to average gradients across workers over the network. It then applies the updates to the model.
Local gradient aggregation is similar to using larger effective batch sizes. However, it does not directly increase the batch size; it performs an effective batch size increase that is not limited by the available GPU memory. The amount of computation is increased by the workers per unit of communication, thereby increasing the batch size and reducing communication overhead by a factor of N with local gradient aggregation.
Grouped Allreduce is a method to optimize Horovod’s performance when training at a supercomputing scale. In the updated version, Nvidia has contributed a general API to bring the improvements of Grouped Allreduce to the open-source Horovod community.
Grouped Allreduce gives users direct control over how Horovod joins (or “groups”) tensors for allreduce. When one provides a list of tensors to hvd.grouped_allreduce, it treats it logically as a single request. Then, the backend will process it only when all tensors in the list are available.
Elastic Horovod on Ray
In v0.20, Horovod introduced Elastic Horovod, an auto-scaling and fault-tolerant API for TensorFlow and PyTorch. In v0.21, one can launch such jobs on preemptible cloud instances with just a few code lines using Horovod on Ray’s Elastic Executor. This API brings a fault-tolerant, auto-scaling distributed training to one’s existing Ray cluster. With the ElasticRayExecutor, one can now safely train on a cluster of preemptible instances that can come and go from the cluster anytime during the entire training process.
Horovod Spark Estimators with Databricks
Horovod Spark Estimators enables one to train a deep learning model with Horovod as part of any PySpark Pipeline. In v0.21.0, Databricks have added support for running Horovod Spark Estimators in the Databricks Runtime for ML environment (AWS | Azure).
The next primary aim of the Horovod project is its v1.0. It would aim at solidifying the core API and the newly introduced Elastic Horovod API. Major priorities would be:
- Higher-level API to simplify Elastic training and data loading.
- Feature parity between all supported frameworks- TensorFlow, PyTorch, and MXNet.
- Improved error handling, debuggability, and messaging.
- Slow worker mitigation (removal) with Elastic Horovod.