coding

Running 120B AI Models on Networked Mini PCs

An experiment shows how to run 120-billion parameter AI language models on two networked mini PCs using Thunderbolt connections and distributed inference

Running 120B AI Models on Networked Mini PCs

What It Is

A recent experiment demonstrates how to run massive language models - the kind typically reserved for cloud infrastructure - on consumer hardware by networking two compact PCs together. The setup connects two Bosgame M5 mini PCs, each equipped with AMD’s Strix Halo chips, using Thunderbolt cables to create a distributed inference system.

The technical approach relies on llama.cpp’s Remote Procedure Call (RPC) functionality, which splits model inference tasks across multiple machines. Each M5 unit contributes 256GB of RAM and integrated GPU resources, creating a combined pool of 512GB memory. This configuration successfully runs models like GPT-OSS-120B at over 50 tokens per second on a single machine, while the networked setup handles Minimax-M2.1 Q6 at 18 tokens per second.

The total hardware investment sits around €3,200 for both systems plus USB4 cables - a fraction of what enterprise GPU clusters cost. The Strix Halo chips provide both substantial memory bandwidth and integrated graphics processing without requiring discrete GPUs.

Why It Matters

This approach challenges the assumption that running frontier-scale models requires either expensive cloud subscriptions or dedicated server hardware. Research teams, independent developers, and small organizations can now experiment with 100B+ parameter models using equipment that fits on a desk.

The economics shift significantly when comparing ongoing cloud costs versus one-time hardware purchases. Organizations running continuous inference workloads or fine-tuning experiments may find the payback period surprisingly short. Privacy-conscious applications also benefit, since all processing happens locally without sending data to external APIs.

The networking aspect opens possibilities for incremental scaling. Rather than replacing entire systems, developers can add capacity by connecting additional units as needs grow. This modular approach suits projects with uncertain resource requirements or budget constraints.

However, the current implementation reveals important limitations. Prompt preprocessing performance lags considerably, creating noticeable delays before inference begins. This makes the setup less suitable for interactive applications requiring instant responses, though batch processing workloads remain viable.

Getting Started

Setting up distributed inference requires configuring llama.cpp with RPC support across networked machines. The basic workflow involves:

First, install llama.cpp on both systems with RPC compilation flags enabled. The build process requires specifying network backend support:

make -j

Connect the mini PCs using USB4/Thunderbolt cables capable of handling the bandwidth requirements. Configure one machine as the RPC server and the other as the client, specifying network addresses and ports in the llama.cpp configuration.

Load model weights distributed across both systems, ensuring memory allocation balances appropriately. The RPC layer handles communication between machines during inference, though optimal performance requires tuning buffer sizes and batch parameters.

Detailed setup guides and community troubleshooting resources are available at https://strixhalo.wiki, where builders share configuration files and optimization tips. The Discord community provides real-time assistance for hardware compatibility issues and performance tuning.

Context

Traditional approaches to running large models typically involve either cloud services like OpenAI’s API or local deployment on high-end GPUs. Cloud solutions offer convenience but accumulate costs quickly with heavy usage. Single-GPU setups hit memory limits around 24-48GB, restricting model sizes to roughly 30B parameters with quantization.

The networked mini PC approach occupies a middle ground - higher upfront cost than cloud subscriptions but lower than professional GPU workstations. Compared to a single RTX 4090 system, the dual M5 setup provides dramatically more memory at similar total cost.

Performance characteristics differ from GPU-based inference. The integrated graphics and system memory architecture trades raw speed for capacity. Applications requiring maximum throughput still benefit from dedicated accelerators, while memory-bound workloads find the expanded RAM pool advantageous.

Future testing with vLLM and 345B parameter models will reveal whether this architecture scales effectively to even larger models. The prompt processing bottleneck remains the primary concern for production deployments, though ongoing llama.cpp optimizations may address this limitation.