LLaDA2.1 Hits 1587 TPS with Token-to-Token Editing

What It Is

LLaDA2.1 introduces a fundamentally different approach to text generation through Token-to-Token editing. Unlike traditional language models that generate text sequentially and cannot revise what they’ve already produced, this architecture can identify and correct its own mistakes during the generation process itself.

The system operates in two distinct modes. S Mode prioritizes speed by generating aggressively and then applying corrections on the fly. The 100B flash variant achieves 892 tokens per second on HumanEval+ benchmarks using this approach. Q Mode takes a more conservative path, sacrificing some throughput for improved accuracy on standard evaluation tasks.

The smaller 16B mini model demonstrates the architecture’s efficiency by reaching approximately 1587 tokens per second on coding tasks. This performance level from a relatively compact model suggests significant architectural improvements over conventional transformer-based approaches.

Multi-Block Editing represents the core innovation. Rather than treating previously generated tokens as immutable, the model can revisit and revise earlier sections of its output. This capability trades some raw speed for better reasoning quality, particularly on complex tasks requiring internal consistency.

Why It Matters

Speed improvements of this magnitude change the economics of LLM deployment. Organizations running high-volume inference workloads could see substantial reductions in compute costs and latency. A model generating at 1587 TPS can complete tasks in seconds that might take traditional models minutes.

The dual-mode architecture addresses a persistent tension in model deployment. Development teams often face a choice between fast models with acceptable quality or slower models with better accuracy. Having both modes in a single architecture simplifies infrastructure and allows dynamic switching based on task requirements.

Code generation workloads benefit particularly from this approach. Programming tasks frequently require internal consistency - variable names, function signatures, and logic flow must align throughout a file. The ability to revise earlier tokens helps maintain this consistency without requiring multiple generation passes or external verification steps.

Smaller teams and individual developers gain access to performance previously requiring much larger models. The 16B mini variant’s throughput makes real-time applications more feasible without enterprise-scale infrastructure.

Getting Started

The models are available through the Hugging Face collection at https://huggingface.co/collections/inclusionAI/llada21. The repository includes both the 100B flash and 16B mini variants.

Implementation code and examples can be found at https://github.com/inclusionAI/LLaDA2.X. The repository contains inference scripts and configuration options for both S and Q modes.

A basic inference example might look like:


model = LLaDA21.from_pretrained("inclusionAI/llada21-16b-mini")
model.set_mode("S") # or "Q" for quality mode

output = model.generate(
 prompt="Write a function to calculate fibonacci numbers",
 max_tokens=500,
 enable_multiblock_editing=True
)

Technical details and architecture documentation are available in the paper at https://huggingface.co/papers/2602.08676.

Context

Traditional speculative decoding approaches attempt to improve throughput by generating multiple token candidates in parallel, then verifying them. LLaDA2.1’s Token-to-Token editing differs by allowing retroactive corrections rather than just parallel speculation.

Models like GPT-4 Turbo and Claude 3 Opus achieve strong quality but typically generate at 50-150 tokens per second. Smaller fast models like Phi-3 or Gemma reach higher throughput but sacrifice reasoning capability. LLaDA2.1’s dual-mode approach attempts to span both use cases.

The Multi-Block Editing mechanism introduces computational overhead. Tasks requiring extensive revision of earlier tokens will see reduced throughput compared to the peak numbers. Teams should benchmark their specific workloads to determine whether S or Q mode provides better overall performance.

Model size remains a consideration. While the 16B mini variant is relatively compact, the 100B flash model requires substantial GPU memory. Deployment at scale still demands appropriate infrastructure planning.

LLaDA2.1 Achieves 1587 TPS with Token Editing

LLaDA2.1 Hits 1587 TPS with Token-to-Token Editing

What It Is

Why It Matters

Getting Started

Context

Related Tips

Testing Hermes Skins with GLM 5.1 AI Model

AI Giants Form Alliance Against Chinese Model Theft

Gemma 4 Jailbroken 90 Minutes After Release