coding

ik_llama.cpp Enables True Parallel Multi-GPU Inference

ik_llama.cpp is a fork of llama.cpp that enables true parallel processing across multiple GPUs rather than just pooling VRAM, using split mode graph execution

ik_llama.cpp Unlocks Real Multi-GPU Performance

What It Is

ik_llama.cpp represents a significant fork of the popular llama.cpp inference engine, introducing genuine parallel processing across multiple GPUs. Traditional multi-GPU configurations in llama.cpp primarily served to pool VRAM, allowing larger models to fit in memory by spreading layers across cards. The actual computation, however, remained largely sequential.

This fork implements a “split mode graph” execution strategy that fundamentally changes how inference workloads distribute across available GPUs. Rather than treating additional cards as overflow storage, the system actively parallelizes computation across all devices simultaneously. Early benchmarks show 3x-4x performance improvements on dual and triple GPU configurations, transforming what were essentially expensive VRAM expansions into legitimate performance multipliers.

The implementation focuses on optimizing the computational graph to identify parallelizable operations and distribute them intelligently. This approach differs from simple layer splitting by analyzing dependencies and scheduling work to minimize idle time across GPUs.

Why It Matters

The economics of local LLM deployment shift dramatically with effective multi-GPU scaling. A single NVIDIA RTX 4090 or professional-grade A6000 represents a $1,500-$5,000 investment. Two or three mid-range cards like RTX 4070 Ti units at $800 each can now deliver superior throughput while maintaining flexibility for other workloads.

Homelab enthusiasts and small research teams gain the most immediate benefit. Running quantized 70B parameter models becomes practical on consumer hardware, opening capabilities previously reserved for cloud services or enterprise budgets. The ability to scale horizontally with commodity GPUs also simplifies incremental upgrades - adding a third card to boost performance beats replacing an entire system.

Cloud deployment strategies also evolve. Providers offering GPU instances can optimize costs by allocating multiple smaller instances rather than reserving high-end hardware. Developers testing inference performance can spin up multi-GPU configurations temporarily without committing to expensive single-card solutions.

The timing proves particularly relevant given current GPU market conditions. Supply constraints and AI demand have inflated prices across the board, making efficient use of available hardware critical for anyone running models locally.

Getting Started

The fork lives at https://github.com/ikawrakow/ik_llama.cpp and follows standard llama.cpp build procedures with additional compilation flags for multi-GPU support. Developers familiar with the original project will recognize the structure.

Basic compilation requires CUDA toolkit installation and enabling the split mode feature:

Running inference with split mode enabled involves specifying the number of GPUs and split strategy:

./main -m model.gguf -p "prompt text" -ngl 99 -sm row -mg 2

The -sm row flag activates row-wise splitting across GPUs, while -mg 2 designates two devices for computation. Experimenting with different split strategies (row, layer, none) helps identify optimal configurations for specific models and hardware combinations.

Monitoring GPU utilization through nvidia-smi during inference confirms whether workloads distribute effectively. Balanced usage across cards indicates proper parallelization, while one GPU maxing out suggests configuration adjustments may help.

Context

Standard llama.cpp supports multi-GPU setups but primarily for memory pooling through layer offloading. The -ngl parameter controls how many layers move to GPU memory, and with multiple cards, layers spread across available VRAM. Computation still processes sequentially through layers, limiting performance gains.

Other inference engines like vLLM and TensorRT-LLM offer multi-GPU support with varying approaches. vLLM implements tensor parallelism for serving workloads, while TensorRT-LLM optimizes for NVIDIA hardware specifically. These solutions target production deployments and often require more complex setup than llama.cpp’s straightforward compilation.

The split mode graph approach trades some compatibility for performance. Not all model architectures or quantization schemes benefit equally, and some edge cases may require fallback to standard execution. Testing specific models before committing to hardware purchases remains advisable.

Memory bandwidth between GPUs can bottleneck performance depending on motherboard PCIe lane configurations. Systems with proper x16 slots for each card see better scaling than configurations forcing cards into x8 or x4 modes.