Pytorch DDP works by splitting data across GPUs and synchronizing gradients via all-reduce operations. This introduces communication overhead and if communication is slow you will observe
- Training speed doesn’t scale well with GPU count anymore
- Resource utilization issues
Assesing the issue
Here’s a simple setup I typically use to assess this issue before launching a training job
import torch
import torch.distributed as dist
import torch.profiler as profiler
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl")
model = testmodel().cuda()
model = DDP(model)
with profiler.profile(
activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA],
on_trace_ready=profiler.tensorboard_trace_handler('./log_ddp'),
with_stack=True,
record_shapes=True,
profile_memory=True,
with_flops=True
) as prof:
for epoch in range(epochs):
train_one_epoch()
prof.step()NCCL Observations
I usually track the following key metrics.
- ncclAllReduce duration per step
- Overlapping compute vs comm
- PCIe/NVLink bandwidth usage
Optimizations
The optimizations I usually apply are the following;
- Larger batches = fewer sync points
- PyTorch DDP buckets small gradients to reduce overhead, but tune
bucket_cap_mbto allow earlier communication. - Use
torch.cuda.ampwhich reduces gradient size, and there’s size less data to sync
Testbed Result
I used a single node with 8x NVIDIA H100 GPUs connected with NVLink using Pytorch NCCL backend.
Before Optimization: 3.5x speedup
After NCCL tuning + mixed precision: 7.3x speedup
This reduces idle time by ~58%.
I’ll publish more soon around this, since I was contacted to help debug this issue earlier this week. It might be a good topic to cover here. Stay tuned!