Distributed Learning: A Primer. Behind the algorithms that make Machine… | by Samuel Flender | Nov, 2022

By Jessie Hobb On Nov 28, 2022

Behind the algorithms that make Machine Learning models bigger, better, and faster

Distributed learning is one of the most critical components in the ML stack of modern tech companies: by parallelizing over a large number of machines, one can train bigger models on more data faster, unlocking higher-quality production models with more rapid iteration cycles.

But don’t just take my word for it. Take Twitter’s:

Using customized distributed training […] allows us to iterate faster and train models on more and fresher data.

Or Google’s:

Our experiments show that our new large-scale training methods can use a cluster of machines to train even modestly sized deep networks significantly faster than a GPU, and without the GPU’s limitation on the maximum size of the model.

Or Netflix’s:

We sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud. We wanted to use a reasonable number of machines to implement a powerful machine learning solution using a Neural Network approach.

In this post, we’ll explore some of the fundamental design considerations behind distributed learning, with a particular focus on deep neural networks. You’ll learn about:

model-parallel vs data-parallel training,
synchronous vs asynchronous training,
centralized vs de-centralized training, and
large-batch training.

Let’s get started.

Model-parallelism vs data-parallelism

There are two main paradigms in distributed training of deep neural networks, model-parallelism, where we distribute the model, and data-parallelism, where we distribute the data.

Model-parallelism means that each machine contains only a partition of the model, for example certain layers of a deep neural network (‘vertical’ partitioning), or certain neurons from the same layer (‘horizontal’ partitioning). Model-parallelism can be useful if a model is too large to fit on a single machine, but it requires large tensors to be sent in between machines, which introduces high communication overhead. In the worst case, one machine may sit idle while waiting for the previous machine to complete its part of the computation.

Data-parallelism means that each machine has a complete copy of the model, and runs a forward and backward pass over its local batch of the data. By definition, this paradigm scales better: we can always add more machines to the cluster, either

by keeping the global (cluster-wide) batch size fixed and reducing the local (per-machine) batch size, or
by keeping the local batch size fixed and increasing the global batch size.

In practice, model and data parallelism are not exclusive, but complementary: there is nothing stopping us from distributing both the model and the data over our cluster of machines. Such a hybrid approach can have its own advantages, as outlined in a blog post by Twitter.

Lastly, there is also hyperparameter-pallelism, where each machine runs the same model on the same data, but with different hyperparamters. In its most basic form, this scheme is of course embarrassingly parallel.

Synchronous vs asynchronous training

In data-parallelism, a global batch of data is distributed evenly over all machines in the cluster at each iteration in the training cycle. For example, if we train with a global batch size of 1024 on a cluster with 32 machines, we’d send local batches of 32 to each machine.

In order for this to work, we need a parameter server, a dedicated machine that stores and keeps track of the most recent model parameters. Workers send their locally computed gradients to the parameter server, which in turns sends the updated model parameters back to the workers. This can be done either synchronously or asynchronously.

In synchronous training, the parameter server waits for all gradients from all workers to arrive, and then updates the model parameters based on the average gradient, aggregated over all workers. The advantage of this approach is that the average gradient is less noisy, and therefore the parameter update of higher quality, enabling faster model convergence. However, if some workers take much longer to compute their local gradients, then all other workers have to sit idle, waiting for the stragglers to catch up. Idleness, of course, is a waste of compute resources: ideally, every machine should have something to do at all times.

In asynchronous training, the parameter server updates the model parameters as soon as it receives a single gradient from a single worker, and sends the updated parameters immediately back to that worker. This eliminates the problem of idleness, but it introduces another problem, namely staleness. As soon as the model parameters are updated based on the gradient from a single worker, all other workers are now working with stale model parameters. The more workers, the more severe the problem: for example, with 1000 workers, by the time the slowest worker has completed its computation, it will be 999 steps behind.

Asynchronous data-parallelism. Each worker sends their local gradients to the parameter server, and receives the model parameters. (Image source: Langer et al 2020, link)

Synchronous data-parallelism. Before sending the updated model parameters, the parameter server aggregates the gradients from all workers. (Image source: Langer et al 2020, link)

A good rule of thumb may therefore be to use asynchronous training if the number of nodes is relatively small, and switch to synchronous training if the number of nodes is very large. For example, researchers from Google trained their ‘flood-filling network’, a deep neural network for brain image segmentation, on 32 Nvidia K40 GPUs using asynchronous parallelism. However, in order to train the same model architecture on a supercomputer with 2048 compute nodes, researchers from Argonne National Lab (including this author) used synchronous parallelism instead.

In practice, one can also find useful compromises between synchronous and asynchronous parallelism. For example, researchers from Microsoft propose a ‘cruel’ modification to synchronous parallelism: simply leave the slowest workers behind. They report that this modification speeds up training by up to 20% with no impact on the final model accuracy.

Centralized vs de-centralized training

The disadvantage of having a central parameter server is that the communication demand for that server grows linearly with the cluster size. This creates a bottleneck, which limits the scale of such a centralized design.

In order to avoid this bottleneck, we can introduce multiple parameter servers, and assign to each parameter server a subset of the model parameters. In the most de-centralized case, each compute node is both a worker (computing gradients) and also a parameter server (storing a subset of the model parameters). The advantage of such a de-centralized design is that the workload and the communication demand for all machines is identical, which eliminates any bottlenecks, and makes it easier to scale.

Large-batch training

In data parallelism, the global batch size grows linearly with the cluster size. In practice, this scaling behavior enables training models with extremely large batch sizes that would be impossible on a single machine because of its memory limitations.

One of the most critical questions in large-batch training is how to adjust the learning rate in relation to the cluster size. For example, if model training works well on a single a single machine with a batch size of 32 and a learning rate of 0.01, what is the right learning rate when adding 7 more machines, resulting in a global batch size of 256?

In a 2017 paper, researchers from Facebook propose the linear scaling rule: simply scale the learning rate linearly with the batch size (i.e., use 0.08 in the example above). Using a GPU cluster with 256 machines and a global batch size of 8192 (32 per machine) the authors train a deep neural network on the ImageNet dataset in just 60 minutes, a remarkable achievement at the time the paper came out, and a demonstration of the power of large-batch training.

Illustration of large-batch training. Larger batches create less noisy gradients, which enable larger steps and therefore faster convergence. (Image source: McCandlish et al 2018, link)

However, large-batch training has limits. As we’ve seen, in order to take advantage of larger batches, we need to increase the learning rate in order to take advantage of the additional information. But if the learning rate is too large, at some point the model may overshoot and fail to converge.

The limits of large-batch training appear to depend on the domain, ranging from batches of tens of thousands for ImageNet to batches of Millions in Reinforcement Learning agents learning to play the game Dota 2, explains a 2018 paper from OpenAI. Finding a theoretical explanation for these limits is an unsolved research question. After all, ML research is largely empirical, and lacks a theoretical backbone.

Conclusion

To recap,

distributed learning is a critical component in the ML stack of modern tech companies, enabling training bigger models on more data faster.
in data-parallelism, we distribute the data, and in model-parallelism we distribute the model. In practice, both can be used in combination.
in synchronous data-parallelism, the parameter server waits for all workers to send their gradients, in asynchronous data-parallelism it does not. Synchronous data-parallelism enables more accurate gradients at the expense of introducing some amount of idle time while the fastest workers have to wait for the slowest.
in completely de-centralized data-parallelism, each worker is also a parameter server for a subset of the model parameters. De-centralized design equalizes the computation and communication demands over all machines, and therefore eliminates any bottlenecks.
data-parallelism enables large-batch training: we can train a model rapidly on a large cluster by scaling the learning rate linearly with the global batch size.

And this is just the tip of the iceberg. Distributed learning remains an active area of research, with open questions such as: what are the limits of large-batch training? How can be optimize a real-world cluster with multiple training jobs creating competing workloads? How to best handle a cluster containing a mix of compute resources such as CPUs and GPUs? And how can we balance exploration with exploitation when searching over many possible models or hyperparameters?

Welcome to the fascinating world of distributed learning.