Distributed Parallel Training — Model Parallel Training | by Luhui Hu | Sep, 2022

By Jessie Hobb On Sep 13, 2022

Distributed model parallel training for large models in PyTorch

Recent years have seen an exponential increase in the scale of deep learning models and the challenge of distributed parallel training. For example, the famous GPT-3 has 175 billion parameters and 96 attention layers with a 3.2 M batch size and 499 billion words. Amazon SageMaker training platform can achieve a throughput of 32 samples per second on 120 ml.p4d.24xlarge instances and 175 billion parameters. If we increase this up to 240 instances, the full model will take 25 days to train.

Training parallelism on GPUs becomes necessary for large models. There are three typical types of distributed parallel training: distributed data parallel, model parallel, and tensor parallel. We often group the latter two types into one category: Model Parallelism, and then divide it into two subtypes: pipeline parallelism and tensor parallelism. We will focus on distributed model parallel training here and demonstrate how to develop in PyTorch.

Understanding Distributed Model Parallel Training

Model parallelism shards a model across multiple GPUs, unlike data parallelism replicating the same model for all training GPUs. There are three critical points for model parallel training: 1. how to shard the layers of a model effectively; 2. how to train sharded layers in parallel other than sequentially; 3. how to design excellent throughputs between nodes. If we cannot balance the shards or run them up in parallel, we cannot achieve the goal of model parallelism.

Distributed training is a kind of training parallelism with multiple nodes in a cluster or resource pool. Containerization makes it easy to scale nodes, and Kubernetes orchestrates them well. Each node can have multiple GPUs and multiple containers. A container can control one or more GPUs. Model parallelism can dispatch the layers of a model across a cluster of distributed GPU nodes. This approach can scale out the training effectively.

We can illustrate distributed model parallel training below.

In the above example, there are two nodes in the cluster. Each node has one or two containers, and each container also has one or two GPUs. The seven layers of a model are distributed across these GPUs.

Each GPU or node has resource limits. Model parallelism enables the training to be parallelized while the distributed training can eventually scale out.

Model Parallelism in PyTorch

The above description shows that distributed model parallel training has two main parts. It is essential to design model parallelism in multiple GPUs to realize this. PyTorch wraps this up and alleviates the implementation. There are only three small changes in PyTorch.

Use “to(device)” to identify a specific device (or GPU) for particular layers (or sub-networks) of a model.
Add a “forward” method accordingly to move intermediate outputs across devices.
Specify the label outputs on the same device when calling the loss function. And the “backward()” and “torch.optim” will automatically handle gradients as if running on a single GPU.

Let’s write some dummy code to illustrate it. Assume to run a simple two-layer model on two GPUs, put each linear layer on each GPU, and move inputs and intermediate outputs to the related layer GPUs correspondingly. We can define a dummy model as follows:

import torch
import torch.nn as nnclass DummyModel(nn.Module):
def __init__(self):
super(DummyModel, self).__init__()
self.net0 = nn.Linear(20, 10).to('cuda:0')
self.relu = nn.ReLU()
self.net1 = nn.Linear(10, 10).to('cuda:1')def forward(self, x):
x = self.relu(self.net0(x.to('cuda:0')))
return self.net1(x.to('cuda:1'))

Now we can add training code with a loss function below:

import torch
import torch.nn as nn
import torch.optim as optim
import DummyModelmodel = DummyModel()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)optimizer.zero_grad()
outputs = model(torch.randn(30, 10))
labels = torch.randn(30, 10).to('cuda:1')
loss_fn(outputs, labels).backward()
optimizer.step()

The same idea can be extended to more complicated models quickly. We can group multiple layers in a “Sequential” and allocate them to a specific GPU via “to(device).” Of course, there are more improvements for optimizing parallel efficiency, but we won’t cover them here.

TL;DR

Distributed model-parallel training has two primary concepts. Model parallelism realizes training large models that cannot run on a single GPU or device. Distributed training can scale out effectively by sharding a model across distributed devices. PyTorch and other libraries (like SakeMaker) make life easier with minimal changes, though it is sophisticated to implement internally.