coding

TurboQuant: 4.6x KV Cache Compression for Apple Silicon

TurboQuant implements Google's KV cache compression for Apple Silicon using custom Metal kernels, achieving 4.6x compression while maintaining 98% of FP16

TurboQuant on Apple Silicon: 4.6x KV Cache Compression

What It Is

TurboQuant brings Google’s KV cache compression technique to Apple Silicon through custom Metal kernels optimized for MLX. The implementation achieves 4.6x compression of the key-value cache while maintaining 98% of FP16 inference speed on models like Qwen2.5-32B running on M4 Pro hardware.

KV cache stores previously computed attention keys and values during text generation, allowing models to avoid recomputing them for each new token. This cache grows linearly with context length and can consume several gigabytes of memory. TurboQuant compresses this cache through quantization - converting the high-precision floating-point values to lower-precision representations - but does so in a way that preserves model quality while dramatically reducing memory footprint.

The breakthrough here involves fusing quantization and dequantization operations directly into Metal compute kernels rather than using standard MLX operations. This architectural choice transforms what would otherwise be a significant performance penalty into nearly imperceptible overhead.

Why It Matters

Memory constraints represent one of the primary bottlenecks when running large language models on consumer hardware. A 16K context window that typically requires 4.2GB of KV cache drops to just 897MB with TurboQuant - a reduction that makes the difference between a model fitting in available memory or failing to run entirely.

Mac users working with local LLMs gain immediate practical benefits. Longer context windows become feasible on machines with limited unified memory. Models that previously required 64GB or 96GB configurations can now run on 32GB systems. This democratizes access to larger models and longer conversations without requiring hardware upgrades.

The performance characteristics matter equally. Early quantization approaches often traded memory savings for substantial speed penalties. TurboQuant’s 0.98x speed ratio means developers can enable compression by default without meaningful impact on user experience. The technique maintains output quality while delivering these gains, avoiding the accuracy degradation that plagued earlier compression methods.

For the broader MLX ecosystem, this implementation demonstrates how platform-specific optimizations can unlock capabilities that generic cross-platform approaches miss. Metal’s tight integration with Apple Silicon enables optimizations impossible on other hardware.

Getting Started

The implementation lives at https://github.com/arozanov/turboquant-mlx and integrates with existing MLX workflows. Developers can clone the repository and experiment with the compression technique:

A pull request to merge this functionality into mlx-lm is pending at https://github.com/ml-explore/mlx-lm/pull/1067. If merged, the feature would become available to all mlx-lm users without requiring separate installation.

The developer documented the complete optimization process, including benchmark methodology and performance analysis, at https://medium.com/@antonrozanov/turboquant-on-mlx-4-6x-kv-cache-compression-with-custom-metal-kernels-9cdee3f7d2a2. This writeup explains the journey from an initial 0.28x speed implementation to the final fused kernel approach.

Testing requires an Apple Silicon Mac - M1, M2, M3, or M4 series chips all support Metal compute. The technique works with various model architectures, though benchmarks focus on Qwen2.5-32B as a representative large model.

Context

TurboQuant joins several approaches to managing KV cache memory. PagedAttention, used in vLLM, reorganizes cache storage to reduce fragmentation. Multi-Query Attention and Grouped-Query Attention reduce cache size by sharing keys and values across attention heads. Each technique offers different tradeoffs between memory, speed, and implementation complexity.

The custom Metal kernel approach limits portability - this optimization works exclusively on Apple Silicon. CUDA-based systems require different implementations, and CPU-only inference cannot benefit from these GPU-specific optimizations. Developers targeting multiple platforms need fallback paths.

Compression ratios and speed impacts vary by model architecture and context length. The 4.6x compression and 0.98x speed figures represent specific test conditions. Different models may see different results based on their attention patterns and layer configurations.

The technique assumes quantization errors remain within acceptable bounds for the target application. While benchmarks show maintained quality, applications requiring absolute numerical precision might need additional validation before deploying compressed caches in production.