Horovod: Uber’s Open Source Distributed Deep Learning Framework

0
1007
Source: https://github.com/horovod/horovod
-Advertisement-

‘Horovod’ is an open-source distributed deep learning framework created by Uber’s AI team. This framework is used for applications in TensorFlow, Keras, PyTorch, and Apache MXNet.

The objective of ‘Horovod’ is to make distributed deep learning fast and easy to take a single-GPU training script and scale it successfully to train across many GPUs in parallel. This has two conditions:

  1. How many changes does one have to make to a program to make it distributed, and how easy is it to run it?
  2. How much faster can it run in distributed mode?

Please see the chart below that represents the benchmark that was done on 128 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:

https://github.com/horovod/horovod

Github: https://github.com/horovod/horovod

Paper: https://arxiv.org/abs/1802.05799

Documentation: https://horovod.readthedocs.io/en/latest/

Installation:

Install the horovod pip package.

To run on CPUs:

$ pip install horovod

To run on GPUs with NCCL:

$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod

Horovod Benchmarks

Source: https://github.com/horovod/horovod

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.