chatgpt

KimiLinear MLA Cache Cuts 1M Context to 14.9GB VRAM

Kimi's Linear MLA cache architecture reduces memory requirements for one million token context windows to just 14.9GB of VRAM through efficient attention

Someone got KimiLinear working with proper MLA KV cache support, which is kind of a game changer for running long context locally.

The implementation cuts 1M token KV cache from 140GB down to just 14.875GB at f16. That means running massive contexts on GPUs with way less VRAM.

To try it:

Grab the model from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF

The cool part is adjustable KV cache quants - q4_0 drops it to 4.184GB, q8_0 sits at 7.902GB. Pretty useful if VRAM is tight.

KimiLinear used to top the ContextArena leaderboard (https://contextarena.ai/) before getting deprecated for unclear reasons. Still works though, and handles long context surprisingly well for local inference.