30B Model Handles 10M Tokens via Subquadratic Attention

Someone released a 30B model that handles massive context windows without the usual performance collapse.

The trick is subquadratic attention - instead of checking every token against every other token (O(L^2)), it does a two-stage search that’s O(L^(3/2)). Basically scores larger chunks first, picks the most relevant ones, then does detailed attention only within those.

Practical numbers on a single B200:

1M tokens: 109 tok/s decode, 66GB memory
10M tokens: 76 tok/s decode, 120GB memory

When context goes 10x bigger, speed only drops by ~30% instead of becoming unusable.

Install and run:

Repo has OpenAI-compatible server built in. Full details at https://github.com/concavity-ai/superlinear

Model weights: https://huggingface.co/concavity-ai/superlinear-exp-v0.1

Paper breakdown: https://arxiv.org/abs/2601.18401

Pretty significant for anyone trying to process entire codebases or long documents locally without needing a cluster.

30B Model Handles 10M Tokens via Subquadratic Attention

Related Tips

20B Parameter Model Runs Locally in Browser

GLM-5: 744B Sparse Model with 40B Active Parameters

KaniTTS2: Fast Local Text-to-Speech with Cloning