NCCL Plugin for Multi-Subnet RDMA Triangle Mesh
The NCCL Plugin for Multi-Subnet RDMA Triangle Mesh enables high-performance GPU communication across multiple network subnets using Remote Direct Memory
Someone needed to cluster three NVIDIA DGX Sparks in a triangle mesh and found NCCL completely chokes on multi-subnet setups. Each node had different subnet routes to its peers, and NCCL’s networking just assumes everything’s on one subnet.
The fix: writing a custom NCCL network plugin from scratch in ~1500 lines of C.
What it handles:
- Subnet-aware NIC selection (picks the right interface per peer)
- Raw RDMA verbs (QP state machines, memory registration)
- Custom TCP handshake to dodge deadlocks
Result: 8+ GB/s distributed inference across all three nodes over RDMA, on a config NVIDIA definitely doesn’t officially support.
Code’s on GitHub: https://github.com/autoscriptlabs/nccl-mesh-plugin
Pretty niche use case, but turns out if you’re running unsupported multi-node RDMA topologies, sometimes you just have to implement your own transport layer. The debugging involved lots of segfaults and RDMA state machine headaches, but it actually works.
Related Tips
KaniTTS2: Fast Local Text-to-Speech with Cloning
KaniTTS2 provides a fast, locally-run text-to-speech system with voice cloning capabilities, enabling users to generate natural-sounding speech from text while
AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac
AdaLLM enables genuine 4-bit floating-point inference on RTX 4090 GPUs without reverting to 16-bit precision, delivering faster and more memory-efficient large
Chatbot Framework Rebuilt in Rust: 10MB Binary
A chatbot framework originally written in another language has been completely rewritten in Rust, resulting in a remarkably compact 10MB binary that