Running 80B LLMs Locally on AMD Strix Halo APU
Guide explores running 80-billion parameter large language models locally on AMD's Strix Halo APU, covering performance, memory requirements, and setup
Running 80B Models on AMD Strix Halo with llamacpp
Large language models have traditionally required expensive server hardware or cloud computing resources. A developer wanting to run Llama 3 80B or similar models locally faced a choice between severely quantized versions that sacrificed quality or investing thousands in NVIDIA GPUs. AMD’s Strix Halo APU changes this equation by integrating up to 128GB of unified memory accessible to both CPU and integrated GPU.
Performance Characteristics
The Strix Halo architecture delivers practical inference speeds for 80B parameter models when paired with llamacpp. Early benchmarks show the APU achieving 8-12 tokens per second with Q4_K_M quantization on Llama 3 80B, sufficient for interactive chat and code generation tasks. This performance stems from the unified memory architecture allowing the integrated RDNA 3.5 GPU to access the full 128GB without PCIe bottlenecks.
Running llamacpp with the --gpu-layers flag set to 40-50 layers produces optimal results. The remaining layers execute on the Zen 5 CPU cores, which handle the workload efficiently thanks to AVX-512 support. A typical command looks like:
./llama-cli -m llama-3-80b-q4_k_m.gguf -n 512 --gpu-layers 45 -ngl 45 --ctx-size 4096
Memory bandwidth becomes the primary constraint rather than compute power. The LPDDR5X-8000 memory configuration provides approximately 256 GB/s bandwidth, adequate for maintaining consistent token generation without significant stalls.
Architectural Advantages
Strix Halo represents a departure from traditional discrete GPU setups. The APU combines 16 Zen 5 CPU cores with 40 RDNA 3.5 compute units on a single die, all sharing access to system memory. This unified memory architecture eliminates the need to split model weights between CPU RAM and VRAM, a common limitation when running large models on consumer hardware.
The integrated GPU supports ROCm, AMD’s compute platform that llamacpp leverages through its CLBlast and Vulkan backends. While ROCm support on APUs has historically lagged behind discrete GPUs, the Strix Halo platform receives priority optimization from AMD. Llamacpp version 3.0 and later includes improved AMD GPU detection and memory management specifically targeting this architecture.
Power efficiency stands out as another architectural benefit. The entire system draws 65-120W under load, compared to 350W+ for a discrete GPU setup capable of similar performance. This makes Strix Halo viable for always-on local AI assistants or development workstations where power consumption matters.
Hardware Requirements
A Strix Halo system needs specific components to maximize llamacpp performance. The 128GB memory configuration is essential for 80B models. While 96GB configurations exist, they leave minimal headroom for the operating system and context cache. LPDDR5X-8000 memory provides the bandwidth necessary to feed both CPU and GPU compute units.
Storage speed impacts model loading times significantly. An NVMe Gen 4 SSD reduces the 40-second load time for an 80B Q4 model to under 15 seconds. The model files themselves consume 45-50GB, so a 1TB drive provides comfortable working space for multiple models and quantization variants.
Cooling requirements remain modest compared to discrete GPU setups. The APU’s 65W TDP allows standard tower coolers to maintain boost clocks during extended inference sessions. Ambient temperatures below 75°C ensure consistent performance without thermal throttling.
Alternative Approaches
Running 80B models locally presents several competing options. NVIDIA RTX 4090 systems with 24GB VRAM require model splitting across system RAM, introducing latency from PCIe transfers. The total cost typically exceeds $2,500 for comparable performance.
Cloud inference through providers like Together AI or Replicate costs $0.60-$1.20 per million tokens for 80B models. This becomes expensive for development workflows generating millions of tokens monthly. A Strix Halo system pays for itself within 6-12 months of moderate usage.
Smaller quantized models on standard hardware offer another path. A 34B model with Q6 quantization fits in 32GB RAM and runs on conventional desktop hardware. However, the capability gap between 34B and 80B models remains substantial for complex reasoning tasks and code generation.
Apple’s M3 Max with 128GB unified memory provides similar capabilities but locks users into macOS. Strix Halo systems run Linux or Windows, offering greater flexibility for development environments and tool compatibility.
The llamacpp project maintains active development at https://github.com/ggerganov/llama.cpp with regular optimizations for AMD hardware. Community benchmarks and configuration guides appear in the discussions section, providing practical tuning advice for Strix Halo deployments.
Related Tips
Caveman: Slashing AI Development Time on Benchmarks
Caveman is an AI development tool that dramatically reduces the time required to run and iterate on machine learning benchmarks through intelligent caching and
Abliteration: Surgical Removal of AI Safety Filters
Abliteration is a technique that surgically removes safety filters from AI language models by identifying and eliminating specific neural pathways responsible
AgentHandover: Auto-Generate AI Skills from Screen Use
AgentHandover automatically generates reusable AI skills by observing and learning from user screen interactions, enabling automation of repetitive computer