Running 120B AI Models on Networked Mini PCs
Researchers demonstrate running 120-billion parameter AI models across networked mini PCs using distributed computing techniques, making large language models
Someone figured out how to run massive AI models by networking two Bosgame M5 PCs (Strix Halo chips) via Thunderbolt cables.
The setup uses llama.cpp’s RPC feature to split model inference across both machines. With 512GB total RAM and dual iGPUs, they’re running models like:
- GPT-OSS-120B at 50+ tokens/s (single PC)
- Minimax-M2.1 Q6 at 18 tokens/s (networked)
Total cost was around €3,200 for both systems plus USB4 cables.
Getting started: Check out the Strix Halo wiki for setup guides and join their Discord for troubleshooting.
The catch? Prompt preprocessing is painfully slow right now, though inference speed is solid once it gets going. They’re planning to test vLLM with 345B models next, which could be interesting for anyone tired of cloud API costs.
Pretty wild that consumer hardware can handle models this size without melting into a puddle.
Related Tips
Benchmark Models in Transformers for Real Speed
Benchmark Models in Transformers for Real Speed explores performance testing methodologies and evaluation techniques for transformer architectures, comparing
ktop: Unified GPU/CPU Monitor for Hybrid Workloads
ktop is a unified monitoring tool that provides real-time visibility into both GPU and CPU performance metrics for hybrid workloads running across
llama.cpp Gets Full MCP Support with Tools & UI
llama.cpp now includes complete Model Context Protocol support, enabling developers to use tools and a user interface for enhanced local language model