GLM-4.7-Flash hits 2000+ t/s on RTX 6000 Blackwell

Someone got GLM-4.7-Flash running at 2000+ tokens/sec for prompt processing on an RTX 6000 Blackwell, which is pretty ridiculous for a 4.7B model.

The trick was using a special llama.cpp branch with flash attention support and adding a specific override flag. They grabbed the GGUF from https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF and ran it with:

--override-kv deepseek2.expert_gating_func=int:2

Generation speed hit 97 tokens/sec, and the output quality was surprisingly good for such a small model.

Update: The patch got merged into llama.cpp master and the GGUFs were fixed, so you can just use the regular setup now. Early adopters had to deal with some wonky quants that produced nonsense because they were created with the wrong gating function.

GLM-4.7-Flash hits 2000+ t/s on RTX 6000 Blackwell

Related Tips

"Take a Deep Breath" Boosts AI Accuracy on Hard Tasks

LLMs Can Now Play Balatro Autonomously via API

ACE-Step v1 Runs on 8GB VRAM with CPU Offload