KimiLinear MLA Cache Cuts 1M Context to 14.9GB VRAM

Someone got KimiLinear working with proper MLA KV cache support, which is kind of a game changer for running long context locally.

The implementation cuts 1M token KV cache from 140GB down to just 14.875GB at f16. That means running massive contexts on GPUs with way less VRAM.

To try it:

Grab the model from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF

The cool part is adjustable KV cache quants - q4_0 drops it to 4.184GB, q8_0 sits at 7.902GB. Pretty useful if VRAM is tight.

KimiLinear used to top the ContextArena leaderboard (https://contextarena.ai/) before getting deprecated for unclear reasons. Still works though, and handles long context surprisingly well for local inference.

KimiLinear MLA Cache Cuts 1M Context to 14.9GB VRAM

Related Tips

DeepSeek Quietly Tests Updated Model with Recent Knowledge

GPT-OSS 120B Uncensored: Zero Refusals Reported

Kyutai's Hibiki Zero: 3B Speech-to-Speech Model