Qwen 3.5 40B Models Trained on Claude Reasoning

Alibaba’s Qwen team achieved a 91.2% score on the GPQA Diamond benchmark with their latest 40B parameter models, matching performance levels previously seen only in significantly larger systems. The Qwen 3.5 40B series introduces two specialized variants trained on reasoning traces from Claude models, marking an unusual cross-company collaboration in the open-source AI landscape.

The release includes Qwen-3.5-40B-Instruct-Reasoning and Qwen-3.5-40B-Instruct-Reasoning-Turbo, both fine-tuned on synthetic reasoning data generated by Anthropic’s Claude 3.7 Sonnet. This approach leverages Claude’s chain-of-thought capabilities to teach the Qwen models more structured problem-solving patterns without requiring the computational overhead of training from scratch.

Benchmarks

The reasoning-enhanced models demonstrate substantial improvements over the base Qwen 3.5 40B across multiple evaluation frameworks. On AIME 2024, a challenging mathematics competition benchmark, the reasoning variant scored 23.3% compared to 16.7% for the standard model. The GPQA Diamond results show similar gains, with the reasoning model reaching 91.2% versus 86.8% for the base version.

Mathematical reasoning tasks reveal the most dramatic improvements. The models achieve 89.4% on GSM8K and 78.2% on MATH-500, outperforming several 70B parameter competitors. Code generation benchmarks show more modest gains, with LiveCodeBench scores of 42.1% and HumanEval results at 88.4%.

The turbo variant trades some accuracy for speed, completing reasoning tasks approximately 40% faster while maintaining competitive scores. On most benchmarks, the turbo model performs within 2-3 percentage points of the full reasoning version, making it suitable for applications where latency matters more than marginal accuracy improvements.

How to Run It

Both models are available through Hugging Face and can be deployed using the Transformers library. The basic implementation requires approximately 80GB of VRAM for inference at full precision:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen-3.5-40B-Instruct-Reasoning"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Solve: If 3x + 7 = 22, what is x?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

For systems with limited VRAM, quantized versions reduce memory requirements to approximately 20GB using 4-bit precision. The models support standard inference optimization techniques including FlashAttention-2 and vLLM for production deployments.

Ollama users can access the models through https://ollama.com/library/qwen3.5-reasoning, simplifying local deployment without manual configuration. The turbo variant is recommended for real-time applications where sub-second response times are critical.

Limitations

The reasoning models inherit Claude’s verbose output patterns, generating significantly longer responses than necessary for simple queries. This verbosity increases inference costs and latency, particularly problematic for applications processing high request volumes. The models sometimes produce unnecessary intermediate steps even when direct answers would suffice.

Training on synthetic data introduces potential biases from Claude’s reasoning style. The models occasionally mirror Claude-specific quirks, including particular phrasings and explanation structures that may not represent optimal reasoning paths. This dependency on a single source model limits diversity in problem-solving approaches.

Performance gains concentrate heavily in mathematical and logical reasoning tasks. Natural language understanding, creative writing, and open-ended conversation show minimal improvement over the base model. The specialized training makes these variants less suitable as general-purpose assistants compared to the standard Qwen 3.5 40B release.

The 40B parameter count requires substantial hardware for deployment. While smaller than 70B alternatives, the models remain inaccessible for edge devices or consumer hardware without aggressive quantization that degrades reasoning capabilities.

Verdict

Qwen 3.5 40B Reasoning models deliver measurable improvements in mathematical and logical reasoning tasks, offering a middle ground between compact 7B models and resource-intensive 70B systems. The Claude-derived training data proves effective for teaching structured problem-solving, though the approach works best for specific use cases rather than general applications.

Organizations focusing on mathematical computation, scientific reasoning, or educational tools will find the performance gains justify the additional inference costs. The turbo variant provides a practical option for latency-sensitive deployments where perfect accuracy is less critical than consistent, fast responses.

Qwen 3.5 40B Matches Larger Models with Claude Data

Qwen 3.5 40B Models Trained on Claude Reasoning

Benchmarks

How to Run It

Limitations

Verdict

Related Tips

Claude Code Creator Confirms Caching Crisis

Memoriki: Persistent Memory Layer for Claude Code

Automated Claude Task Scheduler with Git Isolation