general

GLM-4.7-Flash Hits 2000+ Tokens/Sec on RTX 6000

GLM-4.7-Flash achieves over 2000 tokens per second on NVIDIA RTX 6000 Blackwell GPU, demonstrating how compact language models can deliver exceptional

What It Is

GLM-4.7-Flash, a compact 4.7 billion parameter language model, recently demonstrated exceptional performance on NVIDIA’s RTX 6000 Blackwell GPU, achieving over 2000 tokens per second during prompt processing. This benchmark represents a significant milestone for smaller models, showing that efficient architecture and optimized inference can deliver speeds typically associated with high-end server deployments.

The model runs through llama.cpp, a popular C++ inference engine that enables running large language models on consumer hardware. The breakthrough came from combining flash attention support with specific configuration overrides that properly handle the model’s mixture-of-experts architecture. The GGUF format (a quantized model format) allows the model to run efficiently while maintaining output quality, with generation speeds reaching 97 tokens per second.

Why It Matters

This performance level changes the economics of local AI deployment. Organizations and developers working with budget constraints can now achieve throughput that previously required expensive server infrastructure. A 4.7B model hitting 2000+ t/s on a single GPU means real-time applications become feasible without cloud dependencies or multi-GPU setups.

The speed advantage particularly benefits applications requiring rapid context processing - think document analysis, code review tools, or interactive assistants that need to ingest large prompts quickly. While generation speed (97 t/s) matters for user-facing responses, the prompt processing speed determines how fast systems can analyze documents, codebases, or conversation histories.

Smaller models like GLM-4.7-Flash also reduce operational costs. Lower memory requirements mean more instances can run simultaneously on the same hardware, and reduced power consumption translates to lower electricity bills for teams running continuous inference workloads. The model’s performance demonstrates that architectural innovations can sometimes outweigh raw parameter count.

Getting Started

The setup process has become straightforward after recent updates merged the necessary patches into llama.cpp master. Developers can download the quantized model from https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF and run it with standard llama.cpp commands.

For those who experimented early, the critical configuration flag was:

--override-kv deepseek2.expert_gating_func=int:2

This override ensures the mixture-of-experts gating function operates correctly. Early GGUF versions had incorrect gating configurations that produced nonsensical output, but current releases include proper metadata. Modern users can simply load the model without manual overrides.

The flash attention support requires a compatible llama.cpp build. Compiling from source with CUDA support enabled provides the best performance on NVIDIA GPUs. The RTX 6000 Blackwell’s architecture particularly benefits from these optimizations, though other recent NVIDIA cards should show substantial improvements as well.

Context

GLM-4.7-Flash competes in a crowded space of efficient small models. Alternatives like Phi-3, Gemma 2B, and various Qwen variants offer different trade-offs between size, speed, and capability. The GLM series distinguishes itself through its mixture-of-experts architecture, which activates only relevant parameters for each token, improving efficiency.

Compared to running larger models with aggressive quantization, GLM-4.7-Flash offers more predictable quality. A heavily quantized 13B model might match the speed but often suffers degraded reasoning capabilities. The smaller model’s native design for efficiency means fewer compromises in output coherence.

Limitations remain apparent. While 4.7B parameters suffice for many tasks, complex reasoning, specialized domain knowledge, and nuanced creative writing still favor larger models. The model works best for focused applications - customer support, code completion, structured data extraction - rather than general-purpose assistance requiring broad knowledge.

The Blackwell architecture’s role shouldn’t be overlooked. While the model runs on various hardware, the 2000+ t/s benchmark specifically leverages Blackwell’s improved tensor cores and memory bandwidth. Teams using older GPUs will see lower absolute numbers, though the relative efficiency gains from proper configuration still apply.