coding

Scaling Qwen 3.5 to 1M tokens/sec with vLLM

A benchmark demonstrates how Qwen 3.5 27B achieved over 1 million tokens per second across 12 nodes using vLLM v0.18.0 through strategic configuration changes

Scaling Qwen 3.5 to 1M tokens/sec with vLLM tricks

What It Is

A recent benchmark pushed Qwen 3.5 27B to process over 1 million tokens per second across 12 nodes running vLLM v0.18.0. The achievement required no custom kernels or exotic modifications - just strategic configuration changes that multiplied throughput by roughly 10x per node.

Four specific optimizations drove the results. First, the deployment switched from tensor parallelism (TP=8) to data parallelism (DP=8), distributing complete model copies across GPUs rather than splitting individual model layers. Second, the context window shrank from 131K tokens to 4K, reducing memory overhead for key-value caching. Third, FP8 quantization compressed the KV cache itself. Fourth, and most critically, MTP-1 speculative decoding activated the GPUs - without it, utilization sat at 0% despite the model being loaded.

The combination lifted single-node performance from 9,500 tokens/sec to 95,000 tokens/sec. Scaling to 12 nodes maintained 96% efficiency using basic ClusterIP round-robin load balancing, avoiding the 35% overhead penalty that came with more sophisticated KV-cache-aware routing through an inference gateway.

Why It Matters

These results demonstrate that production-scale LLM serving doesn’t always require bleeding-edge infrastructure or custom CUDA kernels. Teams running high-throughput inference workloads can extract dramatic performance gains from configuration tuning alone.

The shift from tensor to data parallelism particularly matters for deployment planning. While TP splits models across GPUs within a node, DP replicates the entire model on each GPU set. For models that fit comfortably in GPU memory, DP often wins because it eliminates cross-GPU communication during forward passes. This benchmark confirms that pattern holds even for 27B parameter models on modern accelerators.

The speculative decoding finding reveals a critical gap in standard deployment assumptions. GPU utilization hitting 0% without speculation suggests that memory bandwidth, not compute, becomes the bottleneck for autoregressive generation at scale. Speculative decoding drafts multiple tokens in parallel, then verifies them in a single forward pass - converting memory-bound operations into compute-bound ones that actually use the silicon.

Organizations serving millions of requests daily can apply these patterns immediately. The 10x throughput improvement translates directly to infrastructure cost reductions or capacity for 10x more traffic on the same hardware budget.

Getting Started

The full configuration details live at https://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592, but the core vLLM parameters look like this:

 --tensor-parallel-size 1 \
 --pipeline-parallel-size 1 \
 --max-model-len 4096 \
 --kv-cache-dtype fp8 \
 --speculative-model [draft-model] \
 --num-speculative-tokens 1

The --tensor-parallel-size 1 forces data parallelism when running multiple instances. Setting --max-model-len 4096 caps the context window well below Qwen’s 131K maximum. The --kv-cache-dtype fp8 flag enables quantized cache storage. Speculative decoding requires a smaller draft model - the benchmark used MTP-1, though other compact models trained for speculation work similarly.

For multi-node deployments, standard Kubernetes services with round-robin load balancing proved more effective than specialized routing. The inference gateway overhead suggests simpler is sometimes faster.

Context

This approach trades context length for throughput. Dropping from 131K to 4K tokens eliminates use cases requiring long-context reasoning - document analysis, extended conversations, or large codebases. Teams needing those capabilities must accept lower token rates or explore alternative architectures like ring attention.

Speculative decoding introduces its own complexity. Draft models must generate plausible continuations quickly enough to offset verification overhead. Poor draft quality wastes cycles on rejected tokens. Model-specific tuning determines whether speculation helps or hurts.

Alternative serving frameworks like TensorRT-LLM or SGLang offer different optimization profiles. TensorRT-LLM provides tighter NVIDIA integration with custom kernels, while SGLang focuses on structured generation. The vLLM approach here prioritizes reproducibility and standard tooling over absolute peak performance.

The 96% scaling efficiency across 12 nodes is impressive but not universal. Network topology, request patterns, and model architecture all affect multi-node performance. Smaller models or different parallelism strategies might scale differently.