How Artificial Analysis Compares AI Models

Choosing between AI models has become harder as the number of providers grows and each one reports performance using its own methodology. Artificial Analysis, available at https://artificialanalysis.ai/, positions itself as an independent source of analysis meant to help people understand the AI landscape and pick the best model and provider for a given use case.

What The Platform Measures

The centerpiece of the site is an Intelligence Index. The current version, v4.1, combines nine separate evaluations into a single score: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, Humanity’s Last Exam, GPQA Diamond, CritPt, AA-Omniscience, and AA-LCR.

These evaluations target different capabilities rather than a single notion of “smartness.” GDPval-AA v2 covers agentic real-world work tasks, while 𝜏³-Banking measures agentic tool use. Terminal-Bench v2.1 looks at agentic coding and terminal use, and SciCode focuses on coding. GPQA Diamond tests scientific reasoning, CritPt covers physics reasoning, Humanity’s Last Exam measures reasoning and knowledge, and AA-LCR examines long context reasoning. The site also lists additional evaluations such as IFBench for instruction following, MMMU-Pro for visual reasoning, AA-Omniscience for knowledge and hallucination, and AA-Briefcase, described as a frontier agentic evaluation for long-horizon knowledge work.

Beyond intelligence, the platform compares models on output speed, measured in output tokens per second, and on cost. Pricing data covers input tokens, output tokens, and cache hits, and the site reports a Cost per Intelligence Index Task figure that ties spending back to measured capability. Coding is broken out separately through a Coding Index and a Coding Agent Index.

Where The Numbers Come From

Artificial Analysis states that its figures represent performance of a model’s first-party API, or the median across providers when a first-party API is not available. For API provider comparisons, it uses a median (P50) measurement taken over the past 72 hours, which smooths out short-term fluctuations in latency and throughput.

This sourcing approach matters because the same model can behave differently depending on which provider serves it. The site separately benchmarks API providers against each other, comparing performance for a single model across many hosts, and it includes hardware and GPU benchmarking for inference.

Additional Views And Access

The platform offers more than text-model leaderboards. It maintains Image, Video, and Speech arenas that rank outputs using Elo scores derived from comparisons, alongside an Openness Index that assesses how available and transparent a model is. A Data Playground lets people build custom visualizations from the underlying figures.

The core leaderboard content appears to be publicly viewable. A login option exists, and premium plans add expanded benchmark data, custom visualizations, and industry reports.

Reading Benchmarks With Care

A composite score like the Intelligence Index is useful as a starting point, but the breakdown into individual evaluations is where most of the practical value sits. A team building a coding assistant has reason to weight Terminal-Bench and the Coding Index more heavily than physics reasoning, while a team handling long documents may care more about AA-LCR. Pairing those capability scores with the speed and cost data on the same page gives a fuller picture than any single headline number, which is the gap an independent comparison site like this aims to close.

How Artificial Analysis Compares AI Models

How Artificial Analysis Compares AI Models

What The Platform Measures

Where The Numbers Come From

Additional Views And Access

Reading Benchmarks With Care

Related Tips

Auto-Rename Images with Vision Models & Reasoning

AI Diagrams: Chat-Generated, Fully Editable

Evolutionary Model Merge Skips Backprop