coding

SWE-rebench: Real-World Software Engineering Benchmark

SWE-rebench evaluates language models on authentic software engineering tasks from real repositories, including bug fixes and feature implementations in

SWE-rebench: Real-World Coding Benchmark for LLMs

What It Is

SWE-rebench is a benchmark that evaluates language models on authentic software engineering tasks pulled from real repositories. Instead of testing models on isolated coding puzzles or algorithmic challenges, it presents problems that mirror actual development work - fixing bugs in production codebases, implementing feature requests, and handling the messy reality of existing software projects.

The benchmark maintains a public leaderboard at https://swe-rebench.com/ that tracks how different models perform on these tasks. Recent additions include MiniMax M2.1, with GLM-4.7 and Gemini Flash 3 scheduled for upcoming evaluation rounds.

What sets this benchmark apart is the release of 67,074 agentic trajectories - detailed records showing how models navigate through coding problems. These trajectories capture the step-by-step reasoning process, including false starts, debugging attempts, and iterative refinements that models make while working toward solutions. The dataset also includes two Qwen-based checkpoints trained on this trajectory data, available for developers to experiment with.

Why It Matters

Traditional coding benchmarks often test narrow skills - can a model write a function to reverse a string, or implement quicksort? These tests miss critical aspects of real software engineering: understanding existing codebases, navigating file structures, identifying root causes of bugs, and making changes that don’t break other functionality.

SWE-rebench addresses this gap by measuring capabilities that matter for practical AI-assisted development. Teams evaluating models for code generation tools can see which ones handle realistic scenarios rather than just textbook problems. The benchmark reveals whether a model can work with legacy code, interpret vague bug reports, or implement features that require touching multiple files.

The trajectory data provides unprecedented insight into model behavior. Researchers can analyze common failure patterns, identify where models get stuck, and understand the reasoning strategies that lead to successful solutions. This transparency helps the community build better coding assistants by learning from both successes and failures across thousands of attempts.

For developers building agentic systems, these trajectories serve as training data and reference implementations. Seeing how models decompose complex tasks, when they choose to search documentation versus trying solutions, and how they recover from errors offers practical guidance for designing coding agents.

Getting Started

The benchmark leaderboard is accessible at https://swe-rebench.com/ where developers can compare model performance across different tasks and difficulty levels.

To explore the trajectory dataset, visit the Reddit discussion at https://www.reddit.com/r/LocalLLaMA/comments/1puxedb/we_release_67074_qwen3coder_openhands/ which contains links to the full dataset and the Qwen-based checkpoints.

For teams wanting to evaluate their own models, the benchmark provides a standardized testing framework. Models receive a problem description and repository context, then generate solutions that get validated against test suites from the original projects. A typical evaluation might look like:

# Model receives: bug report, repo structure, relevant files
# Expected output: code changes that fix the issue
# Validation: existing test suite must pass

Researchers can analyze the trajectory data to understand model decision-making patterns, extract successful strategies, or train new models on demonstrated problem-solving approaches.

Context

SWE-rebench complements rather than replaces existing coding benchmarks. HumanEval and MBPP test fundamental coding ability through standalone functions. Codex evaluates code completion in various languages. SWE-bench (the predecessor) introduced repository-level tasks but with less trajectory visibility.

The main limitation is evaluation cost - running models through complex repository tasks requires significant compute compared to simple function generation. This makes rapid iteration slower and limits how frequently the leaderboard updates.

The benchmark also reflects biases in the underlying repositories and issue types. Performance on open-source Python projects may not predict success with enterprise Java codebases or embedded systems code. The tasks emphasize bug fixes and features over architecture decisions or code review.

Still, for teams deploying AI coding assistants or researchers advancing code generation capabilities, SWE-rebench offers the most realistic evaluation framework currently available. The trajectory data transforms it from just another leaderboard into a learning resource that reveals how models actually approach software engineering problems.