Pytorch DDP works by splitting data across GPUs and synchronizing gradients via all-reduce operations. This introduces communication overhead and if communication is slow you will observe
- Training speed doesn’t scale well with GPU count anymore
- Resource utilization issues
Assesing the issue
Here’s a simple setup I typically use to assess this issue before launching a training job
import torch
import torch.distributed as dist
import torch.profiler as profiler
from torch.nn.parallel import DistributedDataParallel as DDP
"nccl")
dist.init_process_group(
= testmodel().cuda()
model = DDP(model)
model
with profiler.profile(
=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA],
activities=profiler.tensorboard_trace_handler('./log_ddp'),
on_trace_ready=True,
with_stack=True,
record_shapes=True,
profile_memory=True
with_flopsas prof:
) for epoch in range(epochs):
train_one_epoch() prof.step()
NCCL Observations
I usually track the following key metrics.
- ncclAllReduce duration per step
- Overlapping compute vs comm
- PCIe/NVLink bandwidth usage
Optimizations
The optimizations I usually apply are the following;
- Larger batches = fewer sync points
- PyTorch DDP buckets small gradients to reduce overhead, but tune
bucket_cap_mb
to allow earlier communication. - Use
torch.cuda.amp
which reduces gradient size, and there’s size less data to sync
Testbed Result
I used a single node with 8x NVIDIA H100
GPUs connected with NVLink using Pytorch NCCL backend.
Before Optimization: 3.5x speedup
After NCCL tuning + mixed precision: 7.3x speedup
This reduces idle time by ~58%.
I’ll publish more soon around this, since I was contacted to help debug this issue earlier this week. It might be a good topic to cover here. Stay tuned!