coding

Vibe vs Claude Code: Near-Identical SWE-Bench Results

Vibe and Claude Code achieve nearly identical performance on the SWE-Bench coding benchmark, demonstrating comparable capabilities in solving real-world

Someone ran a pretty thorough comparison between Mistral’s Vibe (Devstral 2) and Claude Code on SWE-bench-verified-mini - 900 total runs across 45 real GitHub issues.

The results were surprisingly close:

  • Claude Code (Opus via auto-selection): 39.8%
  • Vibe (Devstral 2): 37.6%

Basically within margin of error. The wild part is Vibe matched Anthropic’s best model while being open-weight and faster (296s vs 357s average).

The real discovery though: ~40% of test cases gave inconsistent results across runs. Same agent, same bug, different outcomes each time. Even on bugs solved 10/10 times, patch sizes varied up to 8x.

Full breakdown with charts at: https://blog.kvit.app/posts/variance-claude-vibe/

Turns out AI coding agents have way more randomness than most people realize, which makes single-run benchmarks pretty questionable.