Running 80B Model on AMD Strix Halo with llamacpp
This article explores running the 80 billion parameter language model on AMD's Strix Halo APU using llama.cpp, demonstrating local AI inference capabilities on
Someone got the new 80B model running on AMD Strix Halo and shared their llamacpp setup that actually works.
The magic flags that made it smooth:
--flash-attn on --no-mmap
They’re using llamacpp-rocm b1170 with 16k context. The --flash-attn speeds up attention calculations, while --no-mmap keeps everything in RAM instead of memory mapping (apparently helps with stability on ROCm).
Pretty cool to see the 80B/3B active parameter setup running locally on consumer AMD hardware. Turns out the Strix Halo’s integrated GPU can handle these sparse models without needing a dedicated card.
Good reference for anyone trying to run larger models on AMD GPUs - those two flags seem to be the key difference between smooth inference and constant crashes.
Related Tips
Nvidia's DMS Cuts LLM Memory Usage by 8x
Nvidia introduces Dynamic Memory Scheduling that reduces large language model memory consumption by eight times, enabling more efficient AI inference and
Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM
Unsloth Kernels achieves 12x faster Mixture of Experts model training while using only 12GB of VRAM through optimized kernel implementations and memory
Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs
Unsloth Kernels enables efficient fine-tuning of 30 billion parameter Mixture of Experts models on consumer-grade GPUs through optimized memory management and