Vibe vs Claude Code: Near-Identical SWE-Bench Results
Vibe and Claude Code achieve nearly identical performance on the SWE-Bench coding benchmark, demonstrating comparable capabilities in solving real-world
Someone ran a pretty thorough comparison between Mistral’s Vibe (Devstral 2) and Claude Code on SWE-bench-verified-mini - 900 total runs across 45 real GitHub issues.
The results were surprisingly close:
- Claude Code (Opus via auto-selection): 39.8%
- Vibe (Devstral 2): 37.6%
Basically within margin of error. The wild part is Vibe matched Anthropic’s best model while being open-weight and faster (296s vs 357s average).
The real discovery though: ~40% of test cases gave inconsistent results across runs. Same agent, same bug, different outcomes each time. Even on bugs solved 10/10 times, patch sizes varied up to 8x.
Full breakdown with charts at: https://blog.kvit.app/posts/variance-claude-vibe/
Turns out AI coding agents have way more randomness than most people realize, which makes single-run benchmarks pretty questionable.
Related Tips
Parallel Git Worktrees: A Claude Team Productivity Hack
Claude Team members share how parallel Git worktrees enable them to work on multiple branches simultaneously, switching contexts faster and boosting
Claude Code Has Hidden Hook System for Auto-Linting
Claude Code includes a hidden hook system that automatically runs linting tools on code changes, helping developers maintain code quality and catch errors
Smart Claude.md Strategy for Cleaner Monorepos
A practical guide exploring how to use Claude.md files to maintain consistent AI coding assistance across monorepo workspaces, reducing context pollution and