Live RankingsUpdated July 15, 2025

AI Model Benchmark Rankings

Real scores from standardized tests - MMLU, HumanEval, MATH-500, GPQA, and more. Rankings update when new models launch. Click any column header to sort.

Auto-updated on new releases
15 models tracked
7 benchmarks
Last updated: July 15, 2025
Score scale:Excellent (90%+)Strong (78-89%)Good (60-77%)Below 60%
Qwen 2.5 72B🥇
AlibabaOpen Source
88.1
Composite #1

Alibaba's 72B open-weights model with exceptional math performance relative to size. Strong multilingual capabilities.

86.7%
MMLU
86.9%
HumanEval
83.1%
MATH-500
95.7%
GSM8K
Math reasoningMultilingualOpen weights
o1🥈New
OpenAIReasoning
87.8
Composite #2

OpenAI's flagship reasoning model that spends more time thinking before responding, excelling at complex math, coding, and science.

92.3%
MMLU
92.4%
HumanEval
96.4%
MATH-500
78.0%
GPQA
1340
Arena Elo
Scientific reasoningCompetition mathComplex coding
DeepSeek R1🥉New
DeepSeekReasoning
87.6
Composite #3

Open-source reasoning model that matches o1 on many benchmarks. Uses chain-of-thought with reinforcement learning. Weights publicly available.

90.8%
MMLU
92.6%
HumanEval
97.3%
MATH-500
71.5%
GPQA
1358
Arena Elo
Competition mathOpen weightsReasoning chains
Claude 3.7 SonnetNew
AnthropicReasoning
86.0
Composite #4

Anthropic's hybrid reasoning model with extended thinking mode for complex tasks. Sets a new bar on coding and scientific reasoning.

88.3%
MMLU
93.7%
HumanEval
96.2%
MATH-500
84.8%
GPQA
1301
Arena Elo
Extended thinkingScientific QASoftware engineering
Mistral Large 2
MistralFrontier
85.8
Composite #5

Mistral's flagship model with top-tier coding skills and multilingual fluency. Available via API and self-hosted.

84.0%
MMLU
92.0%
HumanEval
74.2%
MATH-500
93.0%
GSM8K
Code generationMultilingualSelf-hostable
Gemini 2.0 FlashNew
GoogleEfficient
82.0
Composite #6

Google's fast multimodal model with strong vision capabilities and native tool use. Optimized for speed and cost.

89.7%
MATH-500
71.7%
MMMU
1354
Arena Elo
SpeedMultimodalTool use
DeepSeek V3New
DeepSeekOpen Source
80.0
Composite #7

Mixture-of-experts frontier model with open weights that rivals GPT-4o at a fraction of the inference cost.

88.5%
MMLU
90.2%
MATH-500
59.1%
GPQA
89.3%
GSM8K
1318
Arena Elo
Open weightsMath reasoningCost efficiency
Llama 3.1 405B
MetaOpen Source
79.9
Composite #8

Meta's largest open-weights model, competitive with leading frontier models and available for commercial use.

88.6%
MMLU
89.0%
HumanEval
73.8%
MATH-500
51.1%
GPQA
96.8%
GSM8K
Open weightsCommercial licenseFine-tunable
Llama 3.3 70BNew
MetaOpen Source
79.4
Composite #9

Meta's updated 70B model matching 405B performance at a fraction of the compute. Best value in the open-source space.

86.0%
MMLU
88.4%
HumanEval
77.0%
MATH-500
50.5%
GPQA
95.1%
GSM8K
EfficiencyOpen weightsInstruction tuning
GPT-4o
OpenAIFrontier
75.9
Composite #10

OpenAI's flagship omni model combining vision, audio, and text. Fast, capable, and deeply integrated with the OpenAI ecosystem.

87.2%
MMLU
90.2%
HumanEval
76.6%
MATH-500
53.6%
GPQA
92.9%
GSM8K
69.1%
MMMU
1285
Arena Elo
MultimodalSpeedAPI ecosystem
Claude 3.5 Sonnet
AnthropicFrontier
75.7
Composite #11

Anthropic's top frontier model balancing intelligence and speed. Strongest on coding tasks among non-reasoning models.

88.3%
MMLU
93.7%
HumanEval
78.3%
MATH-500
65.0%
GPQA
68.3%
MMMU
1282
Arena Elo
Code generationInstruction followingWriting
Llama 3.1 70B
MetaOpen Source
75.3
Composite #12

Meta's capable 70B open-weights model. Widely used as a deployment-friendly open-source option.

86.0%
MMLU
80.5%
HumanEval
68.0%
MATH-500
46.7%
GPQA
95.1%
GSM8K
Open weightsDeployableWell-documented
Gemini 1.5 Pro
GoogleFrontier
72.8
Composite #13

Google's workhorse multimodal model with a massive 2M-token context window. Excellent for long-document analysis.

85.9%
MMLU
84.1%
HumanEval
67.7%
MATH-500
46.2%
GPQA
90.8%
GSM8K
62.2%
MMMU
2M token contextMultimodalLong documents
Claude 3 Opus
AnthropicFrontier
72.8
Composite #14

Anthropic's original flagship model. Excellent for complex reasoning and nuanced analysis tasks.

86.8%
MMLU
84.9%
HumanEval
60.1%
MATH-500
50.4%
GPQA
95.0%
GSM8K
59.4%
MMMU
Nuanced reasoningLong-form writingAnalysis
GPT-4o mini
OpenAIEfficient
71.7
Composite #15

OpenAI's fast and affordable model for high-volume tasks. Punches well above its weight class for the price.

82.0%
MMLU
87.2%
HumanEval
70.2%
MATH-500
40.2%
GPQA
91.3%
GSM8K
59.4%
MMMU
Cost efficiencySpeedHigh volume

What Do These Benchmarks Measure?

MMLU

Massive Multitask Language Understanding - tests general knowledge across 57 subjects including math, science, history, and law.

HumanEval

Python coding benchmark - measures ability to complete function implementations from docstrings. Pass@1 metric.

MATH-500

Competition-level math problems. Tests algebraic reasoning, geometry, calculus, and number theory at olympiad difficulty.

GPQA

Graduate-level Google-Proof Q&A - science questions so hard that PhD experts score around 65%. Tests true expert reasoning.

GSM8K

Grade School Math 8K - elementary and middle school word problems. Good baseline for everyday math reasoning.

MMMU

Massive Multidiscipline Multimodal Understanding - tests vision + text reasoning across 30 subjects. Only for multimodal models.

Arena Elo

LMSYS Chatbot Arena Elo score - based on millions of real user head-to-head comparisons. Reflects practical user preference.

Composite Score is a weighted average of all available benchmark results, normalized to a 0-100 scale. Models with fewer benchmark results may score differently than their actual capability.

Scores are sourced from official model cards, technical reports, and third-party evaluations like the LMSYS Chatbot Arena. Different evaluation setups (few-shot vs. zero-shot, temperature, prompting style) can produce different results.

Benchmark scores are sourced from official model cards and published papers. Scores may vary by evaluation setup. This page is updated when significant new models are released. Last updated: July 15, 2025.

Frequently Asked Questions

Which AI model scores highest overall?

As of our last update, DeepSeek R1 and OpenAI o1 rank highest on our composite score, excelling especially on math and reasoning benchmarks. Claude 3.7 Sonnet leads on GPQA Diamond (graduate-level science). Rankings change frequently as new models are released.

How often are these rankings updated?

We update the rankings whenever a significant new model is announced and official benchmark scores are published by the model provider. The 'Last updated' date at the top shows the most recent refresh.

What is the MMLU benchmark?

MMLU (Massive Multitask Language Understanding) tests a model across 57 subjects including elementary math, US history, computer science, law, and more. It is one of the most widely cited benchmarks for general AI knowledge.

What does GPQA measure?

GPQA Diamond contains PhD-level science questions so difficult that human experts (without Google access) score around 65%. It is used to test whether models have true expert-level reasoning, not just memorized facts.

Why do some models have missing benchmark scores?

Not all providers publish results for every benchmark. Some efficient or specialized models may skip benchmarks that are not relevant to their use case. We show a dash (-) when data is not publicly available.

What is the Chatbot Arena Elo score?

The LMSYS Chatbot Arena Elo score is based on millions of real human head-to-head comparisons where users vote for the better response without knowing which model produced it. It reflects real-world user preference rather than a predefined test, making it a strong indicator of practical quality.

Is a higher Composite Score always better?

Not necessarily. The composite is an average across benchmarks. A model that excels at coding but has fewer math results might score lower than a well-rounded model. Always look at the specific benchmarks that matter for your use case.

What is the difference between frontier and reasoning models?

Frontier models are the most capable general-purpose models from major AI labs. Reasoning models (like o1, DeepSeek R1, Claude 3.7 Sonnet) use extended chain-of-thought processing to spend more time 'thinking' before answering, giving them an edge on complex math and science tasks at the cost of higher latency.