Distributed Parallel Training: Data Parallelism and Model Parallelism | by Luhui Hu | Sep, 2022

By Jessie Hobb On Sep 19, 2022

How to scale out training large models like GPT-3 & DALL-E 2 in PyTorch

Recent years have witnessed exponential growth in the scale of distributed parallel training and the size of deep learning models. In particular, Transformer-based language models have been stealing the show. The notorious GPT-3 blew out with 175 billion parameters and 96 attention layers with a 3.2 M batch size and 499 billion words. Exactly half a year later, Google published Switch Transformer with 1.6 trillion parameters. On the same day (1/11/2021), the Beijing Academy of Artificial Intelligence (BAAI) released the initial Wu Dao 1.0. Before long, Wu Dao 2.0 debuted on 5/31/2021 as the largest language model with 1.75 trillion parameters and ten times GPT-3 parameters.

Suppose we train GPT-3 on 240 ml.p4d.24xlarge instances of the Amazon SageMaker training platform, the whole model will take 25 days to train. The challenge is not just processing but also memory. Wu Tao 2.0 appears to need more than 1000 GPUs only to store its parameters.

It is imperative to employ distributed parallel training for deep learning large models like GPT-3 and DALL-E 2. There are two primary types of distributed parallel training: data parallelism and model parallelism. We further divide the latter into two subtypes: pipeline parallelism and tensor parallelism. We will cover all distributed parallel training here and demonstrate how to develop in PyTorch.

Understanding Distributed Parallel Training

Distributed parallel training has two high-level concepts: parallelism and distribution.

Parallelism is a framework strategy to tackle the size of large models or improve training efficiency, and distribution is an infrastructure architecture to scale out.

In addition to the two basic types of parallelism, there are many more variants, such as expert parallelism. Furthermore, they can be mixed with two or all, such as data and model mixed parallelism. It’s common to mix both model and data parallelism for large-scale models. For instance, the largest T5 models and GPT-3 employ a model and data combined parallelism. However, all of these should be part of the strategy of the DL modeling framework.

On the other side, distribution eventually scales the parallelism out in the cloud or a cluster. Containerization makes it easy to scale nodes, and Kubernetes or cloud solutions can orchestrate them effectively. Each node can have multiple GPUs (or TPUs and other devices) and various containers in a container cluster. In the cloud-native solution, the nodes can be hidden from users. A container manages one or more GPUs. The parallelism can be dispatched across a cluster of distributed GPU containers. So the distribution is the implementation of infrastructure architecture.

Data and weight partitioning strategies of Google Switch Transformers (source: Fedus et al., 2021)

The above illustrates data and weight partitioning strategies in Google Switch Transfers. Each 4×4 dotted-line grid represents 16 cores, and the shaded squares are the data on that core (either model weights or batch of tokens). It demonstrates how the model weights and the data tensors are split for each strategy. The first row illustrates how model weights are divided across the cores. Shapes of different sizes in this row represent larger weight matrices in the Feed Forward Network (FFN) layers (e.g., larger dff sizes). Each color of the shaded squares identifies a unique weight matrix. The number of parameters per core is fixed, but larger weight matrices will apply more computation to each token. The second row demonstrates how the data batch is split across cores. Each core holds the same number of tokens, maintaining a fixed memory usage across all strategies. The partitioning strategies have different properties, allowing each core to have the same or different tokens across cores in different colors.

Data Parallelism in PyTorch

Data parallelism shards data across all cores with the same model. A data parallelism framework like PyTorch Distributed Data Parallel, SageMaker Distributed, and Horovod mainly accomplishes the following three tasks:

First, it creates and dispatches copies of the model, one copy per each accelerator.
It shards the data and then distributes it to the corresponding devices.
It finally aggregates all results together in the backpropagation step.

So we can see that the first task should happen once per training, but the last two tasks should occur in each iteration.

PyTorch Distributed Data Parallel (DDP) implements data parallelism at the module level for running across multiple machines. It can work together with the PyTorch model parallel. DDP applications should spawn multiple processes and create a DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers. Furthermore, DDP registers an autograd hook for each parameter from model.parameters(), and it will fire when the corresponding gradient is computed in the backward pass. DDP then uses that signal to trigger gradient synchronization across processes.

So there are three main steps to set up and run DDP in PyTorch:

Set up distributed system via torch.distributed.
Define the DDP modeling by torch.nn.parallel.
Spawn to run through torch.multiprocessing.

Please see the example code below.

import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mpfrom torch.nn.parallel import DistributedDataParallel as DDP
### Step 1: setup and cleanup setups
def setup(rank, world_size):
...    # initialize the process group
dist.init_process_group("tst", rank=rank, world_size=world_size)def cleanup():
dist.destroy_process_group()
### Step 2: define DDP modeling
def dummy_init(rank, world_size):
setup(rank, world_size)
model = DummyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])    ...    cleanup()
### Step 3: Spawn to run
def run_dummy(dummy_fn, world_size):
mp.spawn(dummy_fn,
args=(world_size,),
nprocs=world_size,
join=True)

Model Parallelism in PyTorch

Model parallelism shards a model (i.e., its layers or tensors) across multiple cores, unlike data parallelism, replicating the same model for all training cores. PyTorch alleviates the parallel implementation and wraps it with minimal changes.

In a nutshell, you need to specify neural network layers and immediate outputs to the desired cores via “to(device)” in three corresponding areas: modeling definition, “forward” method, and “backward” method while calling the loss function. PyTorch will handle all the rest behind the scenes. Please see the example code here.

This may not be straightforward in the real world of large model parallelism. It often needs extra efforts to improve training efficiency and resource utilization. Taking pipeline parallelism as an example, PipeDream improves pipeline efficiency by sacrificing memory to store multiple copies of weights. TeraPipe introduces another pipelining specific to single-transformer architectures, where pipelining occurs across tokens rather than micro-batches. Also, Mesh-TensorFlow and Megatron-LM create a tensor parallelism framework for optimally training billion-parameter models based on TensorFlow and PyTorch, respectively.

Amazon SageMaker model parallelism is a software library on top of PyTorch. It is a general and flexible framework, supporting pipeline and tensor parallelism with memory-saving features. Its pipeline parallelism engine enables load-balancing auto-partitioning and pipelining runtime for arbitrary model architectures based on module-server design. As with pipeline parallelism, the fundamental computational unit for tensor parallelism is nn.Module. In essence, tensor parallelism consists in traversing the model and replacing specific submodules of the model with their distributed implementations.

Take Aways

Distributed parallel training has two high-level concepts of parallelism and distribution. Parallelism is a framework strategy, and distribution is an infrastructure architecture. Distributed parallel training is critical but still nascent in the industry and research. We can look forward to three creative areas happening in the future.

Parallelism shards data, models, or mixed to make large model training work. As data and models grow exponentially, optimizing memory usage and processing efficiency becomes vital.
It’s expensive to train a large model. Transfer learning for reusing the trained layers will change the game of large-scale distributed parallel training.
The ML lifecycle involves multiple distributed systems, from data collection to processing, model training, and serving. ML platform is often hindered by complexity, data communication costs, and system instability. Unifying all distributed systems for ML will be significant.