coding

SWE-rebench: Real-World Coding Benchmark for LLMs

SWE-rebench is a real-world coding benchmark that evaluates large language models on their ability to solve authentic software engineering tasks from

Someone found a useful benchmark for testing how well LLMs actually solve real coding tasks instead of just answering questions.

SWE-rebench tracks model performance on actual software engineering problems - things like fixing bugs in real repos and implementing features. The leaderboard just added MiniMax M2.1 results, with GLM-4.7 and Gemini Flash 3 coming next.

Check it out: https://swe-rebench.com/

What makes this interesting is they also released 67,074 agentic trajectories showing how models work through coding problems step-by-step. Comes with two Qwen-based checkpoints trained on that data.

Full dataset details: https://www.reddit.com/r/LocalLLaMA/comments/1puxedb/we_release_67074_qwen3coder_openhands/

Pretty useful if someone wants to see how different models handle actual development work versus the usual coding benchmarks. The trajectory data shows the reasoning process, not just final answers.