coding by Promptsicle Team

Scaling Qwen 3.5 to 1M Tokens/Sec with vLLM

Technical guide exploring how to scale Qwen 3.5 language model to process one million tokens per second using vLLM optimization framework and deployment

Scaling Qwen 3.5 to 1M tokens/sec with vLLM

Large language model deployments often hit a wall when serving thousands of concurrent users. Response times balloon, hardware costs spiral, and the infrastructure that worked perfectly during testing crumbles under production load. This bottleneck has plagued organizations trying to move AI applications from prototype to production scale.

Recent work combining Alibaba’s Qwen 3.5 models with vLLM, an optimized inference engine, demonstrates how to break through this barrier. Teams have achieved throughput exceeding one million tokens per second, transforming what was previously possible with open-source language models.

Background on the Performance Stack

vLLM emerged from UC Berkeley’s research as a high-throughput serving system specifically designed for large language models. The engine implements PagedAttention, an algorithm that manages attention key-value caches more efficiently than traditional approaches. Instead of pre-allocating large contiguous memory blocks that often go partially unused, PagedAttention stores these caches in non-contiguous memory pages, similar to how operating systems handle virtual memory.

Qwen 3.5, released by Alibaba Cloud, represents the latest iteration of their multilingual model family. The series includes variants ranging from 0.5B to 72B parameters, with particularly strong performance on reasoning tasks and code generation. These models use a standard transformer architecture but incorporate training optimizations that improve both quality and inference efficiency.

The combination proves synergistic. Qwen 3.5’s architecture aligns well with vLLM’s optimization strategies, while vLLM’s batching mechanisms maximize the hardware utilization that Qwen models require at scale.

Key Implementation Details

Achieving million-token-per-second throughput requires careful configuration across multiple dimensions. The setup typically involves multiple GPU instances running vLLM servers, with specific attention to batch sizing and memory management.

A typical deployment configuration looks like this:

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.95,
    max_num_batched_tokens=65536,
    max_num_seqs=256
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

The tensor_parallel_size parameter distributes the model across multiple GPUs, essential for the larger Qwen variants. Setting gpu_memory_utilization to 0.95 maximizes available memory for the KV cache, while max_num_batched_tokens controls how many tokens vLLM processes in a single forward pass.

Continuous batching represents the critical innovation here. Unlike static batching where the system waits for a full batch before processing, vLLM continuously adds new requests to the batch as slots become available. This dramatically improves GPU utilization, particularly when handling requests of varying lengths.

Community Reactions and Validation

Developers deploying these configurations have reported substantial improvements over alternative serving frameworks. Benchmarks shared on GitHub show vLLM with Qwen 3.5 achieving 3-5x higher throughput compared to Hugging Face’s text-generation-inference on identical hardware.

The open-source community has contributed numerous optimization guides and deployment templates. One particularly detailed resource at https://github.com/vllm-project/vllm provides production-ready configurations for various Qwen model sizes and hardware setups.

Some practitioners note that achieving peak throughput requires balancing multiple factors. Extremely large batch sizes can increase latency for individual requests, even as overall throughput climbs. Production deployments often target 80-90% of theoretical maximum throughput to maintain acceptable response times.

Broader Implications for AI Infrastructure

This level of performance fundamentally changes the economics of running large language models. Organizations can serve the same user base with fewer GPUs, or scale to dramatically larger audiences without proportional infrastructure expansion.

The implications extend beyond cost savings. Lower latency and higher throughput enable new application patterns that weren’t previously viable. Real-time content generation, high-frequency API calls, and interactive multi-turn conversations all become more practical at scale.

For the open-source AI ecosystem, these results narrow the gap between proprietary and open models. When deployment efficiency improves this dramatically, organizations can run sophisticated open models with total costs approaching or beating API-based solutions. This shifts the build-versus-buy calculation for many AI applications, potentially accelerating adoption of locally-hosted models where data privacy and control matter most.