SWE-rebench: Real-World Coding Benchmark for LLMs
SWE-rebench is a real-world coding benchmark that evaluates large language models on their ability to solve authentic software engineering tasks from
Someone found a useful benchmark for testing how well LLMs actually solve real coding tasks instead of just answering questions.
SWE-rebench tracks model performance on actual software engineering problems - things like fixing bugs in real repos and implementing features. The leaderboard just added MiniMax M2.1 results, with GLM-4.7 and Gemini Flash 3 coming next.
Check it out: https://swe-rebench.com/
What makes this interesting is they also released 67,074 agentic trajectories showing how models work through coding problems step-by-step. Comes with two Qwen-based checkpoints trained on that data.
Full dataset details: https://www.reddit.com/r/LocalLLaMA/comments/1puxedb/we_release_67074_qwen3coder_openhands/
Pretty useful if someone wants to see how different models handle actual development work versus the usual coding benchmarks. The trajectory data shows the reasoning process, not just final answers.
Related Tips
KaniTTS2: Fast Local Text-to-Speech with Cloning
KaniTTS2 provides a fast, locally-run text-to-speech system with voice cloning capabilities, enabling users to generate natural-sounding speech from text while
AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac
AdaLLM enables genuine 4-bit floating-point inference on RTX 4090 GPUs without reverting to 16-bit precision, delivering faster and more memory-efficient large
Chatbot Framework Rebuilt in Rust: 10MB Binary
A chatbot framework originally written in another language has been completely rewritten in Rust, resulting in a remarkably compact 10MB binary that