Running 80B Model on AMD Strix Halo with llamacpp

Someone got the new 80B model running on AMD Strix Halo and shared their llamacpp setup that actually works.

The magic flags that made it smooth:

--flash-attn on --no-mmap

They’re using llamacpp-rocm b1170 with 16k context. The --flash-attn speeds up attention calculations, while --no-mmap keeps everything in RAM instead of memory mapping (apparently helps with stability on ROCm).

Pretty cool to see the 80B/3B active parameter setup running locally on consumer AMD hardware. Turns out the Strix Halo’s integrated GPU can handle these sparse models without needing a dedicated card.

Good reference for anyone trying to run larger models on AMD GPUs - those two flags seem to be the key difference between smooth inference and constant crashes.

Running 80B Model on AMD Strix Halo with llamacpp

Related Tips

Nvidia's DMS Cuts LLM Memory Usage by 8x

Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM

Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs