coding

Running 80B Model on AMD Strix Halo with llamacpp

This article explores running the 80 billion parameter language model on AMD's Strix Halo APU using llama.cpp, demonstrating local AI inference capabilities on

Someone got the new 80B model running on AMD Strix Halo and shared their llamacpp setup that actually works.

The magic flags that made it smooth:

--flash-attn on --no-mmap

They’re using llamacpp-rocm b1170 with 16k context. The --flash-attn speeds up attention calculations, while --no-mmap keeps everything in RAM instead of memory mapping (apparently helps with stability on ROCm).

Pretty cool to see the 80B/3B active parameter setup running locally on consumer AMD hardware. Turns out the Strix Halo’s integrated GPU can handle these sparse models without needing a dedicated card.

Good reference for anyone trying to run larger models on AMD GPUs - those two flags seem to be the key difference between smooth inference and constant crashes.