general

GLM 4.7 Flash Drops V from KV Cache to Cut VRAM

GLM 4.7 Flash eliminates the value component from its KV cache during inference, storing only keys to reduce memory usage while maintaining transformer

GLM 4.7 Flash Skips V in KV Cache, Saves VRAM

What It Is

GLM 4.7 Flash implements an unusual architectural decision in how it handles attention mechanisms during inference. Traditional transformer models store both keys (K) and values (V) in their KV cache - a memory structure that holds previously computed attention states to avoid redundant calculations. GLM 4.7 Flash breaks this pattern by eliminating the value component entirely, operating with only the key cache during generation.

The KV cache typically grows with context length, consuming VRAM proportional to the number of tokens processed. Each token requires storing both a key vector and a value vector across all attention heads and layers. By removing the V component, GLM 4.7 Flash cuts this memory requirement roughly in half for the cache portion of inference.

This architectural choice appears specific to how the model’s attention layers were trained and structured. The model reconstructs necessary information from keys alone, though the exact mechanism differs from standard multi-head attention implementations found in models like Llama or Mistral.

Why It Matters

Memory constraints represent the primary bottleneck for running large language models locally. A typical 7B parameter model might use 4-6GB for weights in 4-bit quantization, but the KV cache can balloon to 10-20GB when processing long documents or conversations. This cache growth often forces users to truncate context or upgrade hardware.

Halving cache memory requirements fundamentally changes what’s feasible on consumer GPUs. A setup that previously maxed out at 32,000 tokens might reach 64,000 or even 128,000 tokens with the same VRAM budget. This expansion matters for tasks like analyzing lengthy codebases, processing research papers, or maintaining extended conversation history without summarization.

The optimization also benefits inference speed. Smaller memory footprints mean less data movement between GPU memory and compute units, potentially improving tokens-per-second throughput. Community reports at https://www.reddit.com/r/LocalLLaMA/comments/1qmvny5/glm47flash_is_even_faster_now/ suggest noticeable performance gains once inference engines properly recognize and exploit the V-less architecture.

Getting Started

Most inference frameworks still allocate memory for both K and V components by default. Checking whether a particular setup respects GLM 4.7 Flash’s architecture requires examining memory usage during generation.

For llama.cpp users, recent builds include optimizations for models that don’t require value caching. Running with verbose logging shows actual memory allocation:

./main -m glm-4-flash-7b.gguf -n 512 --verbose

Look for cache allocation messages indicating whether V tensors are being created. Builds from late 2024 onward should automatically detect and skip V allocation for compatible models.

vLLM and other serving frameworks may require explicit configuration. The model’s config.json should indicate its cache structure, though not all tools parse this correctly yet. Testing with progressively longer contexts while monitoring VRAM usage (via nvidia-smi or similar) reveals whether the optimization is active.

The GLM 4.7 Flash model files are available through Hugging Face at https://huggingface.co/THUDM/glm-4-flash-7b, with GGUF quantizations for llama.cpp compatibility.

Context

This approach trades architectural simplicity for memory efficiency. Standard attention mechanisms separate keys and values for good reasons - it provides flexibility in how information flows through the network. GLM’s design presumably compensates through other architectural elements, though this may limit certain capabilities compared to conventional transformers.

Other models pursue memory efficiency differently. Grouped-query attention (used in Llama 3 and Mistral) reduces cache size by sharing key-value pairs across attention heads. Multi-query attention takes this further by using a single KV head. These approaches maintain the K+V structure while reducing total cache size by 4-8x compared to full multi-head attention.

GLM 4.7 Flash’s V-less design represents a more aggressive optimization, though it remains model-specific rather than a general technique. Developers working with standard model architectures can’t simply drop V caching without retraining. The approach does demonstrate that transformer attention mechanisms still have room for unconventional optimizations when memory constraints matter more than architectural orthodoxy.