30B Model Processes 10M Tokens with Subquadratic Attention
Concavity AI released Superlinear, a 30-billion parameter language model that processes up to 10 million tokens using a two-stage attention mechanism with
30B Model Handles 10M Tokens via Subquadratic Attention
What It Is
Concavity AI released a 30-billion parameter language model that processes context windows up to 10 million tokens without the performance degradation that typically cripples standard transformer architectures. The model, called Superlinear, implements a two-stage attention mechanism that achieves O(L^(3/2)) complexity instead of the standard O(L^2) quadratic scaling.
Traditional transformer attention compares every token against every other token in the context window. This quadratic relationship means doubling the context length quadruples the computational cost. Superlinear’s approach first scores larger chunks of text to identify the most relevant sections, then performs detailed attention calculations only within those high-scoring regions. This hierarchical search pattern reduces the number of comparisons needed while maintaining the model’s ability to reference distant context.
The practical impact shows up in benchmark numbers on a single NVIDIA B200 GPU. Processing 1 million tokens achieves 109 tokens per second during decoding while consuming 66GB of memory. Scaling up to 10 million tokens - a 10x increase in context length - only drops throughput to 76 tokens per second and requires 120GB of memory. The roughly 30% speed reduction contrasts sharply with the complete performance collapse that standard attention mechanisms experience at these scales.
Why It Matters
This development addresses a fundamental bottleneck in working with large language models locally. Developers analyzing entire codebases, researchers processing long documents, or teams building RAG systems can now handle substantially more context without requiring distributed infrastructure or cloud-based solutions.
The memory efficiency proves particularly valuable. Fitting 10 million tokens in 120GB means teams with high-end workstations or single-GPU servers can process contexts that previously demanded multi-GPU clusters. This democratizes access to long-context capabilities beyond organizations with extensive compute budgets.
For code analysis workflows, 10 million tokens translates to roughly 7-8 million characters of source code - enough to load multiple large repositories simultaneously. Legal document review, scientific literature analysis, and other text-heavy applications gain similar advantages. The model can maintain coherence across these extended contexts rather than losing track of earlier information or requiring chunking strategies that fragment semantic relationships.
The subquadratic attention mechanism also opens research directions for further optimization. If O(L^(3/2)) scaling proves viable at this parameter count, similar techniques might extend to even larger models or enable new architectures that balance context length against other capabilities.
Getting Started
The model and inference code are available through standard Python package management:
The repository at https://github.com/concavity-ai/superlinear includes an OpenAI-compatible API server, allowing integration with existing tools and workflows that expect OpenAI’s endpoint format. This compatibility means developers can swap in Superlinear without rewriting client code.
Model weights are hosted at https://huggingface.co/concavity-ai/superlinear-exp-v0.1 for direct download or use with Hugging Face’s transformers library. The technical paper detailing the attention mechanism and training methodology is available at https://arxiv.org/abs/2601.18401.
Hardware requirements center on GPU memory rather than compute power. The 120GB memory footprint for maximum context length means targeting NVIDIA H100 or B200 GPUs, though smaller context windows work on more accessible hardware.
Context
Several alternative approaches tackle long-context processing. Sparse attention patterns like those in Longformer or BigBird reduce complexity through fixed patterns, but sacrifice the dynamic relevance scoring that Superlinear’s two-stage approach provides. State space models such as Mamba avoid attention entirely, trading different architectural constraints for linear scaling.
Ring attention and other distributed attention methods achieve long contexts by splitting computation across multiple GPUs, but require coordination overhead and specialized infrastructure. Superlinear’s single-GPU capability offers simpler deployment at the cost of the 30B parameter limit.
The experimental version designation (v0.1) suggests this remains early-stage technology. Production use cases should account for potential model updates, limited community testing, and the possibility of edge cases in the attention mechanism. The 30B parameter count also means this model won’t match the raw capabilities of 70B+ models on complex reasoning tasks, even if it handles more context.
Related Tips
Qwen 0.8B Multimodal Model Runs in Browser via WebGPU
Qwen's 0.8B multimodal model now runs entirely in web browsers using WebGPU acceleration, processing both text and images locally without requiring servers or
DeepSeek AI Model Rivals GPT-4 Performance
DeepSeek releases a competitive large language model that rivals GPT-4 and Claude, offering both API access and open weights with strong performance in coding
GLM-5 Training Optimizations: DSA and Async RL
GLM-5 uses Dual-Stage Attention to split sequence processing into coarse and fine-grained phases, plus asynchronous reinforcement learning to reduce training