coding

NCCL Plugin for Multi-Subnet RDMA Triangle Mesh

The NCCL Plugin for Multi-Subnet RDMA Triangle Mesh enables high-performance GPU communication across multiple network subnets using Remote Direct Memory

Someone needed to cluster three NVIDIA DGX Sparks in a triangle mesh and found NCCL completely chokes on multi-subnet setups. Each node had different subnet routes to its peers, and NCCL’s networking just assumes everything’s on one subnet.

The fix: writing a custom NCCL network plugin from scratch in ~1500 lines of C.

What it handles:

  • Subnet-aware NIC selection (picks the right interface per peer)
  • Raw RDMA verbs (QP state machines, memory registration)
  • Custom TCP handshake to dodge deadlocks

Result: 8+ GB/s distributed inference across all three nodes over RDMA, on a config NVIDIA definitely doesn’t officially support.

Code’s on GitHub: https://github.com/autoscriptlabs/nccl-mesh-plugin

Pretty niche use case, but turns out if you’re running unsupported multi-node RDMA topologies, sometimes you just have to implement your own transport layer. The debugging involved lots of segfaults and RDMA state machine headaches, but it actually works.