LLaDA2.1 Achieves 1587 TPS with Token Editing
LLaDA2.1 introduces a token-to-token editing architecture that enables language models to identify and correct their own mistakes during text generation,
LLaDA2.1 Hits 1587 TPS with Token-to-Token Editing
What It Is
LLaDA2.1 introduces a fundamentally different approach to text generation through Token-to-Token editing. Unlike traditional language models that generate text sequentially and cannot revise what they’ve already produced, this architecture can identify and correct its own mistakes during the generation process itself.
The system operates in two distinct modes. S Mode prioritizes speed by generating aggressively and then applying corrections on the fly. The 100B flash variant achieves 892 tokens per second on HumanEval+ benchmarks using this approach. Q Mode takes a more conservative path, sacrificing some throughput for improved accuracy on standard evaluation tasks.
The smaller 16B mini model demonstrates the architecture’s efficiency by reaching approximately 1587 tokens per second on coding tasks. This performance level from a relatively compact model suggests significant architectural improvements over conventional transformer-based approaches.
Multi-Block Editing represents the core innovation. Rather than treating previously generated tokens as immutable, the model can revisit and revise earlier sections of its output. This capability trades some raw speed for better reasoning quality, particularly on complex tasks requiring internal consistency.
Why It Matters
Speed improvements of this magnitude change the economics of LLM deployment. Organizations running high-volume inference workloads could see substantial reductions in compute costs and latency. A model generating at 1587 TPS can complete tasks in seconds that might take traditional models minutes.
The dual-mode architecture addresses a persistent tension in model deployment. Development teams often face a choice between fast models with acceptable quality or slower models with better accuracy. Having both modes in a single architecture simplifies infrastructure and allows dynamic switching based on task requirements.
Code generation workloads benefit particularly from this approach. Programming tasks frequently require internal consistency - variable names, function signatures, and logic flow must align throughout a file. The ability to revise earlier tokens helps maintain this consistency without requiring multiple generation passes or external verification steps.
Smaller teams and individual developers gain access to performance previously requiring much larger models. The 16B mini variant’s throughput makes real-time applications more feasible without enterprise-scale infrastructure.
Getting Started
The models are available through the Hugging Face collection at https://huggingface.co/collections/inclusionAI/llada21. The repository includes both the 100B flash and 16B mini variants.
Implementation code and examples can be found at https://github.com/inclusionAI/LLaDA2.X. The repository contains inference scripts and configuration options for both S and Q modes.
A basic inference example might look like:
model = LLaDA21.from_pretrained("inclusionAI/llada21-16b-mini")
model.set_mode("S") # or "Q" for quality mode
output = model.generate(
prompt="Write a function to calculate fibonacci numbers",
max_tokens=500,
enable_multiblock_editing=True
)
Technical details and architecture documentation are available in the paper at https://huggingface.co/papers/2602.08676.
Context
Traditional speculative decoding approaches attempt to improve throughput by generating multiple token candidates in parallel, then verifying them. LLaDA2.1’s Token-to-Token editing differs by allowing retroactive corrections rather than just parallel speculation.
Models like GPT-4 Turbo and Claude 3 Opus achieve strong quality but typically generate at 50-150 tokens per second. Smaller fast models like Phi-3 or Gemma reach higher throughput but sacrifice reasoning capability. LLaDA2.1’s dual-mode approach attempts to span both use cases.
The Multi-Block Editing mechanism introduces computational overhead. Tasks requiring extensive revision of earlier tokens will see reduced throughput compared to the peak numbers. Teams should benchmark their specific workloads to determine whether S or Q mode provides better overall performance.
Model size remains a consideration. While the 16B mini variant is relatively compact, the 100B flash model requires substantial GPU memory. Deployment at scale still demands appropriate infrastructure planning.
Related Tips
Testing Hermes Skins with GLM 5.1 AI Model
Testing article explores the performance and compatibility of Hermes skins when integrated with the GLM 5.1 AI model, examining rendering quality and system
AI Giants Form Alliance Against Chinese Model Theft
Major AI companies including OpenAI, Google, and Anthropic have formed a coalition to combat intellectual property theft and unauthorized use of their models
Gemma 4 Jailbroken 90 Minutes After Release
Google's Gemma 4 AI model was successfully jailbroken within 90 minutes of its public release, highlighting ongoing security challenges in large language model