general

LM Arena: Blind Model Comparisons with Elo Rankings

LM Arena at lmarena.ai runs blind head-to-head model comparisons with Elo ratings, helping developers pick models based on actual performance rather than marketing.

Someone found a useful way to compare AI models without relying on cherry-picked benchmarks.

LM Arena at https://lmarena.ai runs blind head-to-head comparisons - you submit a prompt to two anonymous models, vote on which response is better, and rankings update based on thousands of evaluations.

How to use it:

  1. Go to https://lmarena.ai/leaderboard
  2. Filter by category: coding, creative writing, reasoning, etc.
  3. Check the confidence intervals - some rankings are tighter than others

For testing locally:

huggingface-cli download Qwen/Qwen2.5-72B-Instruct

Other useful resources:

The Elo system means rankings reflect actual user preferences rather than synthetic benchmarks. Useful for cutting through marketing claims when picking between similar models.