general

GLM 4.7 Flash Skips V in KV Cache, Saves VRAM

GLM 4.7 Flash introduces a novel architecture that eliminates the value cache in key-value attention, significantly reducing VRAM usage while maintaining

Someone found that GLM 4.7 Flash has a quirky optimization - it doesn’t actually use the V (value) part of its KV cache at all. The model only needs the K (key) component to work.

This matters because KV cache normally eats tons of VRAM during long conversations. By skipping V entirely, GLM saves gigabytes when handling longer contexts. People running it locally can push way further on the same hardware.

The catch is most inference engines still allocate VRAM for both K and V by default, even though V sits empty. Someone posted an update about getting even better speed once tools started respecting this: https://www.reddit.com/r/LocalLLaMA/comments/1qmvny5/glm47flash_is_even_faster_now/

Worth checking if your setup supports V-less caching. Could mean the difference between 32k and 128k context on a mid-range GPU.