AI Model Benchmark Rankings
Real scores from standardized tests - MMLU, HumanEval, MATH-500, GPQA, and more. Rankings update when new models launch. Click any column header to sort.
Qwen 2.5 72B🥇 AlibabaOpen Source | 88.1#1 | 86.7% | 86.9% | 83.1% | - | 95.7%🥈 | - | - |
o1New🥈 OpenAIReasoning | 87.8#2 | 92.3%🥇 | 92.4% | 96.4%🥈 | 78.0%🥈 | - | - | 1340🥉 |
DeepSeek R1New🥉 DeepSeekReasoning | 87.6#3 | 90.8%🥈 | 92.6%🥉 | 97.3%🥇 | 71.5%🥉 | - | - | 1358🥇 |
Claude 3.7 SonnetNew AnthropicReasoning | 86.0#4 | 88.3% | 93.7%🥇 | 96.2%🥉 | 84.8%🥇 | - | - | 1301 |
Mistral Large 2 MistralFrontier | 85.8#5 | 84.0% | 92.0% | 74.2% | - | 93.0% | - | - |
Gemini 2.0 FlashNew GoogleEfficientVision | 82.0#6 | - | - | 89.7% | - | - | 71.7%🥇 | 1354🥈 |
DeepSeek V3New DeepSeekOpen Source | 80.0#7 | 88.5% | - | 90.2% | 59.1% | 89.3% | - | 1318 |
Llama 3.1 405B MetaOpen Source | 79.9#8 | 88.6%🥉 | 89.0% | 73.8% | 51.1% | 96.8%🥇 | - | - |
Llama 3.3 70BNew MetaOpen Source | 79.4#9 | 86.0% | 88.4% | 77.0% | 50.5% | 95.1%🥉 | - | - |
GPT-4o OpenAIFrontierVision | 75.9#10 | 87.2% | 90.2% | 76.6% | 53.6% | 92.9% | 69.1%🥈 | 1285 |
Claude 3.5 Sonnet AnthropicFrontier | 75.7#11 | 88.3% | 93.7%🥈 | 78.3% | 65.0% | - | 68.3%🥉 | 1282 |
Llama 3.1 70B MetaOpen Source | 75.3#12 | 86.0% | 80.5% | 68.0% | 46.7% | 95.1% | - | - |
Gemini 1.5 Pro GoogleFrontierVision | 72.8#13 | 85.9% | 84.1% | 67.7% | 46.2% | 90.8% | 62.2% | - |
Claude 3 Opus AnthropicFrontier | 72.8#14 | 86.8% | 84.9% | 60.1% | 50.4% | 95.0% | 59.4% | - |
GPT-4o mini OpenAIEfficientVision | 71.7#15 | 82.0% | 87.2% | 70.2% | 40.2% | 91.3% | 59.4% | - |
Alibaba's 72B open-weights model with exceptional math performance relative to size. Strong multilingual capabilities.
OpenAI's flagship reasoning model that spends more time thinking before responding, excelling at complex math, coding, and science.
Open-source reasoning model that matches o1 on many benchmarks. Uses chain-of-thought with reinforcement learning. Weights publicly available.
Anthropic's hybrid reasoning model with extended thinking mode for complex tasks. Sets a new bar on coding and scientific reasoning.
Mistral's flagship model with top-tier coding skills and multilingual fluency. Available via API and self-hosted.
Google's fast multimodal model with strong vision capabilities and native tool use. Optimized for speed and cost.
Mixture-of-experts frontier model with open weights that rivals GPT-4o at a fraction of the inference cost.
Meta's largest open-weights model, competitive with leading frontier models and available for commercial use.
Meta's updated 70B model matching 405B performance at a fraction of the compute. Best value in the open-source space.
OpenAI's flagship omni model combining vision, audio, and text. Fast, capable, and deeply integrated with the OpenAI ecosystem.
Anthropic's top frontier model balancing intelligence and speed. Strongest on coding tasks among non-reasoning models.
Meta's capable 70B open-weights model. Widely used as a deployment-friendly open-source option.
Google's workhorse multimodal model with a massive 2M-token context window. Excellent for long-document analysis.
Anthropic's original flagship model. Excellent for complex reasoning and nuanced analysis tasks.
OpenAI's fast and affordable model for high-volume tasks. Punches well above its weight class for the price.
What Do These Benchmarks Measure?
Massive Multitask Language Understanding - tests general knowledge across 57 subjects including math, science, history, and law.
Python coding benchmark - measures ability to complete function implementations from docstrings. Pass@1 metric.
Competition-level math problems. Tests algebraic reasoning, geometry, calculus, and number theory at olympiad difficulty.
Graduate-level Google-Proof Q&A - science questions so hard that PhD experts score around 65%. Tests true expert reasoning.
Grade School Math 8K - elementary and middle school word problems. Good baseline for everyday math reasoning.
Massive Multidiscipline Multimodal Understanding - tests vision + text reasoning across 30 subjects. Only for multimodal models.
LMSYS Chatbot Arena Elo score - based on millions of real user head-to-head comparisons. Reflects practical user preference.
Composite Score is a weighted average of all available benchmark results, normalized to a 0-100 scale. Models with fewer benchmark results may score differently than their actual capability.
Scores are sourced from official model cards, technical reports, and third-party evaluations like the LMSYS Chatbot Arena. Different evaluation setups (few-shot vs. zero-shot, temperature, prompting style) can produce different results.
Benchmark scores are sourced from official model cards and published papers. Scores may vary by evaluation setup. This page is updated when significant new models are released. Last updated: July 15, 2025.
Frequently Asked Questions
Which AI model scores highest overall?
As of our last update, DeepSeek R1 and OpenAI o1 rank highest on our composite score, excelling especially on math and reasoning benchmarks. Claude 3.7 Sonnet leads on GPQA Diamond (graduate-level science). Rankings change frequently as new models are released.
How often are these rankings updated?
We update the rankings whenever a significant new model is announced and official benchmark scores are published by the model provider. The 'Last updated' date at the top shows the most recent refresh.
What is the MMLU benchmark?
MMLU (Massive Multitask Language Understanding) tests a model across 57 subjects including elementary math, US history, computer science, law, and more. It is one of the most widely cited benchmarks for general AI knowledge.
What does GPQA measure?
GPQA Diamond contains PhD-level science questions so difficult that human experts (without Google access) score around 65%. It is used to test whether models have true expert-level reasoning, not just memorized facts.
Why do some models have missing benchmark scores?
Not all providers publish results for every benchmark. Some efficient or specialized models may skip benchmarks that are not relevant to their use case. We show a dash (-) when data is not publicly available.
What is the Chatbot Arena Elo score?
The LMSYS Chatbot Arena Elo score is based on millions of real human head-to-head comparisons where users vote for the better response without knowing which model produced it. It reflects real-world user preference rather than a predefined test, making it a strong indicator of practical quality.
Is a higher Composite Score always better?
Not necessarily. The composite is an average across benchmarks. A model that excels at coding but has fewer math results might score lower than a well-rounded model. Always look at the specific benchmarks that matter for your use case.
What is the difference between frontier and reasoning models?
Frontier models are the most capable general-purpose models from major AI labs. Reasoning models (like o1, DeepSeek R1, Claude 3.7 Sonnet) use extended chain-of-thought processing to spend more time 'thinking' before answering, giving them an edge on complex math and science tasks at the cost of higher latency.