general

GLM-4.7-Flash hits 2000+ t/s on RTX 6000 Blackwell

GLM-4.7-Flash achieves breakthrough performance exceeding 2000 tokens per second on NVIDIA's RTX 6000 Blackwell GPU, demonstrating exceptional inference speed

Someone got GLM-4.7-Flash running at 2000+ tokens/sec for prompt processing on an RTX 6000 Blackwell, which is pretty ridiculous for a 4.7B model.

The trick was using a special llama.cpp branch with flash attention support and adding a specific override flag. They grabbed the GGUF from https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF and ran it with:

--override-kv deepseek2.expert_gating_func=int:2

Generation speed hit 97 tokens/sec, and the output quality was surprisingly good for such a small model.

Update: The patch got merged into llama.cpp master and the GGUFs were fixed, so you can just use the regular setup now. Early adopters had to deal with some wonky quants that produced nonsense because they were created with the wrong gating function.