AI models comparison

The foundation models powering the platforms.

Every AI tool on the Yardstick platform list runs on one of these foundation models - sometimes named, sometimes not. This page compares them directly on the dimensions that matter to a buyer choosing a vendor: benchmark performance, input and output cost per million tokens, context window, multimodal support, and whether the model can run on your own infrastructure (open weights) or only via the provider's cloud. Numbers are vendor-published as of 2026-05-04 with the source linked. Where benchmarks differ across versions of the same model family, we score the production-default tier.

Search 19 of 19

Model	Context	Input $/M	Output $/M	MMLU	GPQA	HumanEval	Modalities	Deployment
Frontier proprietary · cloud-only
GPT-5OpenAI	400K	$1.25	$10.00	91.4%	85.7%	94.6%	text + image + audio	Cloud only
GPT-4oOpenAI	128K	$2.50	$10.00	88.7%	53.6%	90.2%	text + image + audio	Cloud only
GPT-4o miniOpenAI	128K	$0.15	$0.60	82.0%	40.2%	87.2%	text + image	Cloud only
o3OpenAI	200K	$2.00	$8.00	92.7%	87.7%	96.7%	text + image (reasoning)	Cloud only
Claude Sonnet 4.5Anthropic	200K	$3.00	$15.00	89.9%	83.4%	93.7%	text + image	Cloud only
Claude Opus 4Anthropic	200K	$15.00	$75.00	90.7%	84.3%	95.4%	text + image	Cloud only
Claude Haiku 3.5Anthropic	200K	$0.80	$4.00	76.0%	41.6%	88.1%	text + image	Cloud only
Gemini 2.5 ProGoogle	2M	$1.25	$10.00	86.4%	84.0%	93.4%	text + image + audio + video	Cloud only
Gemini 2.5 FlashGoogle	1M	$0.30	$2.50	81.7%	66.7%	88.7%	text + image + audio + video	Cloud only
Grok 3xAI	131K	$3.00	$15.00	87.5%	84.6%	86.5%	text + image	Cloud only
Open weights · cloud or local deployment
Llama 3.3 70BMeta	128K	$0.59	$0.79	86.0%	50.5%	88.4%	text	Cloud + Local
Llama 3.1 405BMeta	128K	$3.50	$3.50	88.6%	51.1%	89.0%	text	Cloud + Local
Mistral Large 2Mistral	128K	$2.00	$6.00	84.0%	48.0%	91.5%	text	Cloud + Local
Codestral 25.01Mistral	256K	$0.30	$0.90	71.4%	35.0%	86.6%	text (code-tuned)	Cloud + Local
DeepSeek V3DeepSeek	128K	$0.27	$1.10	87.1%	59.1%	82.6%	text	Cloud + Local
DeepSeek R1DeepSeek	128K	$0.55	$2.19	90.8%	71.5%	96.3%	text (reasoning)	Cloud + Local
Qwen 2.5 72BAlibaba	128K	$0.40	$1.20	86.1%	49.0%	88.4%	text	Cloud + Local
Small / on-device · local-first
Phi-4 (14B)Microsoft	16K	$0.07	$0.14	84.8%	56.1%	82.6%	text	Local-first
Gemma 2 27BGoogle	8K	$0.20	$0.20	75.2%	28.8%	71.4%	text	Local-first

Notes: Cloud only = closed-weights model, accessed via the provider's API. Cloud + Local = open weights downloadable for self-hosting; same model also available on the provider's hosted API. Local-first = open weights designed for on-device or air-gapped deployment. Cost is the published list rate per million tokens for input and output respectively; volume discounts and committed-spend tiers are not reflected. MMLU, GPQA, HumanEval are vendor-published benchmark scores as of 2026-05-04. Treat as directional, not gospel; benchmark contamination is real and methodology varies between providers.

How buyers should read this

A few patterns the table makes obvious once it's side-by-side.

Frontier models cost ~10× more on output than input. Output dominates total cost on chat/completion workloads. Budget on output token volume, not input.
Open-weight models with the right deployment infra are 5-20× cheaper than frontier proprietary on hosted APIs and free for self-hosted (modulo your own compute). DeepSeek R1 in particular gets close to o3 on reasoning at a fraction of the price.
Context window varies by 16× across the cohort - Gemini 2.5 Pro's 2M-token window is its singular differentiator. Most workloads don't need it; some (long-document analysis, code-base RAG) absolutely do.
Multimodal is a flag, not a spectrum. Either the model accepts and reasons over images / audio / video natively, or it doesn't. If your tool needs vision, half the cohort drops out.
Local-first models (Phi-4, Gemma 2) trade benchmark performance for the ability to run on a single workstation or air-gapped server. The right choice for healthcare / finance / public-sector buyers with strict data-residency requirements.
Benchmark numbers are directional. MMLU saturates above ~85%; GPQA is the harder discriminator at the frontier. Pick on the benchmark that matches your workload, not the highest single number.

The Yardstick platform list is what most buyers actually evaluate - pre-built tools that wrap one or more of these models with a workflow. This page is for the buyer who wants to know what's under the hood, or who is considering BYO-LLM via Bedrock / Azure OpenAI / Vertex AI / a self-hosted endpoint.

Methodology

Where these numbers came from.

Benchmark scores, context windows, and pricing are pulled from each provider's public model card or pricing page as of 2026-05-04. We do not run the benchmarks ourselves - benchmark methodology is vendor-controlled and contamination is real, so treat the numbers as the floor of vendor honesty rather than independent measurements. When two versions of the same model family report different scores, we use the production-default tier's numbers. The Yardstick methodology applies the same evidence-labelling discipline (CITED with URL, never fabricated) to model data as it does to platform tear-sheets.

See the platforms that use these models →

Take the free 4-minute readiness audit.

Get your score, peer benchmarks, and three tailored vendor recommendations. No email required to see your results.

Take the audit