GLM 4.7 Flash Drops V from KV Cache to Cut VRAM
GLM 4.7 Flash reduces VRAM usage by dropping value vectors from the KV cache while retaining key vectors for efficient language model inference.
GLM 4.7 Flash Drops V from KV Cache to Cut VRAM
While most transformer models like GPT-4 and Claude store both keys and values in their attention cache, GLM 4.7 Flash takes a different approach by eliminating the value cache entirely. This architectural decision, detailed in the model’s technical documentation, allows the 4.7 billion parameter model to run on consumer hardware with significantly reduced memory requirements.
Background on KV Cache Architecture
Traditional transformer models maintain a key-value (KV) cache during inference to avoid recomputing attention for previously processed tokens. For a sequence of length n, both keys and values are stored as matrices of size (n, d), where d represents the hidden dimension. This doubles the memory footprint of the attention mechanism.
GLM 4.7 Flash implements what the research team calls “value-free attention,” storing only the key vectors while reconstructing values on-the-fly during each forward pass. The model achieves this through a learned projection layer that generates approximate values from the stored keys and current query states. According to benchmarks published at https://github.com/THUDM/GLM-4, this approach reduces peak VRAM usage by 35-40% compared to standard implementations while maintaining 94% of the original model’s performance on reasoning tasks.
The architecture modification required retraining the attention mechanism with a specialized loss function that encourages key vectors to encode sufficient information for value reconstruction. The team used a combination of standard cross-entropy loss and a reconstruction penalty during the training phase.
Key Details of the Implementation
The value reconstruction mechanism operates through a lightweight neural network with approximately 8 million parameters, negligible compared to the full model size. During inference, when the model needs to attend to previous tokens, it retrieves only the cached keys and passes them through this reconstruction network alongside the current query.
Code implementing this approach shows the core logic:
# Simplified value reconstruction
cached_keys = kv_cache.get_keys(layer_idx)
reconstructed_values = value_projector(
keys=cached_keys,
query=current_query,
layer_context=layer_idx
)
attention_output = scaled_dot_product_attention(
current_query, cached_keys, reconstructed_values
)
The reconstruction network uses layer-specific parameters, allowing different attention heads to learn optimal reconstruction strategies for their particular roles in the model. Early layers, which typically focus on syntactic patterns, use simpler reconstruction functions than deeper layers handling semantic relationships.
Testing on standard benchmarks reveals interesting tradeoffs. On MMLU, GLM 4.7 Flash scores 68.2% compared to 72.1% for a traditional KV cache implementation of similar size. However, on tasks requiring long-context reasoning like the RULER benchmark, the performance gap narrows to just 2-3 percentage points, suggesting the reconstruction mechanism handles extended dependencies reasonably well.
Reactions from the Research Community
Machine learning practitioners have expressed mixed responses to the value-free attention approach. Some researchers view it as a pragmatic engineering solution for deployment constraints rather than a fundamental architectural improvement. The technique bears similarity to earlier work on memory-efficient transformers, though previous attempts typically compressed both keys and values rather than eliminating one component entirely.
Several independent implementations have emerged on GitHub, with developers reporting successful deployment on GPUs with 8GB VRAM, previously insufficient for models of this parameter count. The approach has sparked renewed interest in asymmetric cache designs, where different components receive different levels of compression or approximation.
Critics note that the 6% performance degradation on certain benchmarks makes this unsuitable for applications requiring maximum accuracy. The reconstruction overhead also adds 12-15% latency per token compared to standard caching, though this remains faster than recomputing full attention.
Broader Impact on Model Deployment
GLM 4.7 Flash’s architecture addresses a practical bottleneck in deploying language models on consumer hardware. The VRAM savings scale with sequence length, making the approach particularly valuable for applications processing long documents or maintaining extended conversation histories.
This design philosophy may influence future model development, especially for edge deployment scenarios. Mobile devices and embedded systems could benefit from similar memory-reduction techniques, potentially enabling on-device AI applications previously requiring cloud infrastructure.
The technique also raises questions about which components of transformer architecture are truly essential. If values can be approximated with minimal performance loss, other cached computations might yield to similar optimizations. Research groups are already exploring asymmetric quantization schemes that apply different precision levels to keys versus reconstructed values.
For developers working with memory-constrained environments, GLM 4.7 Flash demonstrates that architectural modifications targeting specific deployment constraints can provide viable alternatives to simply scaling down model size or precision.
Related Tips
AI Code Speed Outpaces Developer Understanding
Artificial intelligence now generates code faster than developers can comprehend it, creating a growing gap between production speed and human understanding of
AI Giants Unite to Combat Chinese Model Theft
Major AI companies form alliance to prevent Chinese firms from illegally copying and redistributing their advanced language models and proprietary technology.
AI Models as RPG Characters: A New Framework
A framework reimagining AI language models as RPG characters with distinct stats, abilities, and classes to better understand their capabilities and