30B Model Handles 10M Tokens via Subquadratic Attention
A 30-billion parameter language model achieves 10-million token context processing through novel subquadratic attention mechanisms, dramatically reducing
Someone released a 30B model that handles massive context windows without the usual performance collapse.
The trick is subquadratic attention - instead of checking every token against every other token (O(L^2)), it does a two-stage search that’s O(L^(3/2)). Basically scores larger chunks first, picks the most relevant ones, then does detailed attention only within those.
Practical numbers on a single B200:
- 1M tokens: 109 tok/s decode, 66GB memory
- 10M tokens: 76 tok/s decode, 120GB memory
When context goes 10x bigger, speed only drops by ~30% instead of becoming unusable.
Install and run:
Repo has OpenAI-compatible server built in. Full details at https://github.com/concavity-ai/superlinear
Model weights: https://huggingface.co/concavity-ai/superlinear-exp-v0.1
Paper breakdown: https://arxiv.org/abs/2601.18401
Pretty significant for anyone trying to process entire codebases or long documents locally without needing a cluster.
Related Tips
20B Parameter Model Runs Locally in Browser
A 20 billion parameter AI language model has been successfully optimized to run entirely within a web browser, enabling local deployment without requiring
GLM-5: 744B Sparse Model with 40B Active Parameters
GLM-5 is a 744-billion parameter sparse language model that activates only 40 billion parameters per forward pass, achieving efficient performance through
KaniTTS2: Fast Local Text-to-Speech with Cloning
KaniTTS2 provides a fast, locally-run text-to-speech system with voice cloning capabilities, enabling users to generate natural-sounding speech from text while