Karthik Mohan

Pytorch DDP works by splitting data across GPUs and synchronizing gradients via all-reduce operations. This introduces communication overhead and if communication is slow you will observe

Training speed doesn’t scale well with GPU count anymore
Resource utilization issues

Assesing the issue

Here’s a simple setup I typically use to assess this issue before launching a training job

import torch
import torch.distributed as dist
import torch.profiler as profiler
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")

model = testmodel().cuda()
model = DDP(model)

with profiler.profile(
    activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA],
    on_trace_ready=profiler.tensorboard_trace_handler('./log_ddp'),
    with_stack=True,
    record_shapes=True,
    profile_memory=True,
    with_flops=True
) as prof:
    for epoch in range(epochs):
        train_one_epoch()
        prof.step()

NCCL Observations

I usually track the following key metrics.

ncclAllReduce duration per step
Overlapping compute vs comm
PCIe/NVLink bandwidth usage

Optimizations

The optimizations I usually apply are the following;

Larger batches = fewer sync points
PyTorch DDP buckets small gradients to reduce overhead, but tune bucket_cap_mb to allow earlier communication.
Use torch.cuda.amp which reduces gradient size, and there’s size less data to sync

Testbed Result

I used a single node with 8x NVIDIA H100 GPUs connected with NVLink using Pytorch NCCL backend.

Before Optimization: 3.5x speedup

After NCCL tuning + mixed precision: 7.3x speedup

This reduces idle time by ~58%.

I’ll publish more soon around this, since I was contacted to help debug this issue earlier this week. It might be a good topic to cover here. Stay tuned!