coding

Monitor Distributed Training with NCCL Inspector

NCCL Inspector monitors and troubleshoots distributed deep learning training by analyzing NCCL communication patterns, detecting bottlenecks, and providing

AI engineers deploy NCCL Inspector to monitor distributed training performance without overhead.

Installation:

  • Download from https://github.com/nvidia/nccl-inspector
  • Set environment variable: export NCCL_PLUGIN_P2P=nccl-inspector.so
  • Enable with: export NCCL_INSPECTOR_ENABLE=1

Performance Metrics:

  • Algorithmic bandwidth: Measures communication efficiency per collective operation
  • Bus bandwidth: Tracks actual hardware utilization
  • Per-communicator logging: Isolates bottlenecks by rank and operation type

Output Configuration:

  • NCCL_INSPECTOR_OUTPUT_FORMAT=jsonl: Exports structured logs
  • NCCL_INSPECTOR_INTERVAL=100: Sets measurement frequency (iterations)
  • Convert to Parquet for dashboard integration

Verbose Tracing:

  • NCCL_INSPECTOR_VERBOSE=1: Enables kernel-level profiling

This plugin provides always-on, granular observability without impacting training throughput.