KimiLinear MLA Cache Cuts 1M Context to 14.9GB VRAM
Kimi's Linear MLA cache architecture reduces memory requirements for one million token context windows to just 14.9GB of VRAM through efficient attention
Someone got KimiLinear working with proper MLA KV cache support, which is kind of a game changer for running long context locally.
The implementation cuts 1M token KV cache from 140GB down to just 14.875GB at f16. That means running massive contexts on GPUs with way less VRAM.
To try it:
Grab the model from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF
The cool part is adjustable KV cache quants - q4_0 drops it to 4.184GB, q8_0 sits at 7.902GB. Pretty useful if VRAM is tight.
KimiLinear used to top the ContextArena leaderboard (https://contextarena.ai/) before getting deprecated for unclear reasons. Still works though, and handles long context surprisingly well for local inference.
Related Tips
DeepSeek Quietly Tests Updated Model with Recent Knowledge
DeepSeek conducts quiet testing of an updated AI model that incorporates more recent knowledge and information, potentially improving its capabilities beyond
GPT-OSS 120B Uncensored: Zero Refusals Reported
GPT-OSS 120B Uncensored is an open-source language model reportedly designed without content restrictions, claiming to fulfill all user requests without
Kyutai's Hibiki Zero: 3B Speech-to-Speech Model
Kyutai introduces Hibiki Zero, a compact 3-billion-parameter speech-to-speech model that processes and generates audio directly without intermediate text