GLM-4.7-Flash hits 2000+ t/s on RTX 6000 Blackwell
GLM-4.7-Flash achieves breakthrough performance exceeding 2000 tokens per second on NVIDIA's RTX 6000 Blackwell GPU, demonstrating exceptional inference speed
Someone got GLM-4.7-Flash running at 2000+ tokens/sec for prompt processing on an RTX 6000 Blackwell, which is pretty ridiculous for a 4.7B model.
The trick was using a special llama.cpp branch with flash attention support and adding a specific override flag. They grabbed the GGUF from https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF and ran it with:
--override-kv deepseek2.expert_gating_func=int:2
Generation speed hit 97 tokens/sec, and the output quality was surprisingly good for such a small model.
Update: The patch got merged into llama.cpp master and the GGUFs were fixed, so you can just use the regular setup now. Early adopters had to deal with some wonky quants that produced nonsense because they were created with the wrong gating function.
Related Tips
"Take a Deep Breath" Boosts AI Accuracy on Hard Tasks
Research reveals that adding the phrase 'take a deep breath' to AI prompts significantly improves performance on complex reasoning tasks by encouraging more
LLMs Can Now Play Balatro Autonomously via API
An article discusses how large language models have gained the ability to autonomously play the poker-themed roguelike deck-building game Balatro through API
ACE-Step v1 Runs on 8GB VRAM with CPU Offload
ACE-Step v1 demonstrates efficient AI model execution on consumer hardware by running on systems with only 8GB VRAM through CPU offloading techniques that