Monitor Distributed Training with NCCL Inspector
NCCL Inspector monitors and troubleshoots distributed deep learning training by analyzing NCCL communication patterns, detecting bottlenecks, and providing
AI engineers deploy NCCL Inspector to monitor distributed training performance without overhead.
Installation:
- Download from
https://github.com/nvidia/nccl-inspector - Set environment variable:
export NCCL_PLUGIN_P2P=nccl-inspector.so - Enable with:
export NCCL_INSPECTOR_ENABLE=1
Performance Metrics:
- Algorithmic bandwidth: Measures communication efficiency per collective operation
- Bus bandwidth: Tracks actual hardware utilization
- Per-communicator logging: Isolates bottlenecks by rank and operation type
Output Configuration:
NCCL_INSPECTOR_OUTPUT_FORMAT=jsonl: Exports structured logsNCCL_INSPECTOR_INTERVAL=100: Sets measurement frequency (iterations)- Convert to Parquet for dashboard integration
Verbose Tracing:
NCCL_INSPECTOR_VERBOSE=1: Enables kernel-level profiling
This plugin provides always-on, granular observability without impacting training throughput.
Related Tips
Nvidia's DMS Cuts LLM Memory Usage by 8x
Nvidia introduces Dynamic Memory Scheduling that reduces large language model memory consumption by eight times, enabling more efficient AI inference and
Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM
Unsloth Kernels achieves 12x faster Mixture of Experts model training while using only 12GB of VRAM through optimized kernel implementations and memory
Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs
Unsloth Kernels enables efficient fine-tuning of 30 billion parameter Mixture of Experts models on consumer-grade GPUs through optimized memory management and