Benchmarks

AI model cost & capability benchmarks

Compare leading models on capability and cost efficiency using generated benchmark summaries when available. Static fallback rows are labelled clearly and are not presented as live measurements.

Latest model benchmark snapshot

Generated benchmark summaries are loaded from the public benchmark API when available.

Source: static fallback

Fallback rows are compiled into the site for resilience and are not live benchmark data.

Provider	Model	Capability	Cost efficiency	Measured at
OpenAI	gpt-4.1-mini	—	—	Static fallback
Anthropic	claude-3.5-haiku	—	—	Static fallback
Google	gemini-2.0-flash	—	—	Static fallback
Groq	llama-3.3-70b	—	—	Static fallback

Methodology

Capability score (0–100)

Composite of available public quality suites such as MMLU, HumanEval, HellaSwag, MBPP, and ARC when those generated scores exist. Missing capability data is shown as not enough data.

Cost efficiency score (0–100)

Latest generated cost-efficiency suite score, normalised to 0–100. Higher indicates stronger cost efficiency in the benchmark snapshot.

Refresh cadence

A weekly scheduled job can publish generated scores to the benchmark database. If the API or generated data is unavailable, this page shows a labelled static fallback only.

Disclaimer

Benchmark data is provided for informational purposes. Actual performance varies by use case. Run evals on your own dataset using the ModelSpend evaluation framework.

Route to the best model automatically.

ModelSpend uses live pricing and benchmark data to route each prompt optimally. Setup in 4 minutes.

Route your first call See how routing works