chatgpt

30B Model Handles 10M Tokens via Subquadratic Attention

A 30-billion parameter language model achieves 10-million token context processing through novel subquadratic attention mechanisms, dramatically reducing

Someone released a 30B model that handles massive context windows without the usual performance collapse.

The trick is subquadratic attention - instead of checking every token against every other token (O(L^2)), it does a two-stage search that’s O(L^(3/2)). Basically scores larger chunks first, picks the most relevant ones, then does detailed attention only within those.

Practical numbers on a single B200:

  • 1M tokens: 109 tok/s decode, 66GB memory
  • 10M tokens: 76 tok/s decode, 120GB memory

When context goes 10x bigger, speed only drops by ~30% instead of becoming unusable.

Install and run:

Repo has OpenAI-compatible server built in. Full details at https://github.com/concavity-ai/superlinear

Model weights: https://huggingface.co/concavity-ai/superlinear-exp-v0.1

Paper breakdown: https://arxiv.org/abs/2601.18401

Pretty significant for anyone trying to process entire codebases or long documents locally without needing a cluster.