AI models comparison

The foundation models powering the platforms.

Every AI tool on the Yardstick platform list runs on one of these foundation models - sometimes named, sometimes not. This page compares them directly on the dimensions that matter to a buyer choosing a vendor: benchmark performance, input and output cost per million tokens, context window, multimodal support, and whether the model can run on your own infrastructure (open weights) or only via the provider's cloud. Numbers are vendor-published as of 2026-05-04 with the source linked. Where benchmarks differ across versions of the same model family, we score the production-default tier.

19 of 19
Model Context Input $/M Output $/M MMLU GPQA HumanEval Modalities Deployment
Frontier proprietary · cloud-only
GPT-5OpenAI 400K $1.25 $10.00 91.4% 85.7% 94.6% text + image + audio Cloud only
GPT-4oOpenAI 128K $2.50 $10.00 88.7% 53.6% 90.2% text + image + audio Cloud only
GPT-4o miniOpenAI 128K $0.15 $0.60 82.0% 40.2% 87.2% text + image Cloud only
o3OpenAI 200K $2.00 $8.00 92.7% 87.7% 96.7% text + image (reasoning) Cloud only
Claude Sonnet 4.5Anthropic 200K $3.00 $15.00 89.9% 83.4% 93.7% text + image Cloud only
Claude Opus 4Anthropic 200K $15.00 $75.00 90.7% 84.3% 95.4% text + image Cloud only
Claude Haiku 3.5Anthropic 200K $0.80 $4.00 76.0% 41.6% 88.1% text + image Cloud only
Gemini 2.5 ProGoogle 2M $1.25 $10.00 86.4% 84.0% 93.4% text + image + audio + video Cloud only
Gemini 2.5 FlashGoogle 1M $0.30 $2.50 81.7% 66.7% 88.7% text + image + audio + video Cloud only
Grok 3xAI 131K $3.00 $15.00 87.5% 84.6% 86.5% text + image Cloud only
Open weights · cloud or local deployment
Llama 3.3 70BMeta 128K $0.59 $0.79 86.0% 50.5% 88.4% text Cloud + Local
Llama 3.1 405BMeta 128K $3.50 $3.50 88.6% 51.1% 89.0% text Cloud + Local
Mistral Large 2Mistral 128K $2.00 $6.00 84.0% 48.0% 91.5% text Cloud + Local
Codestral 25.01Mistral 256K $0.30 $0.90 71.4% 35.0% 86.6% text (code-tuned) Cloud + Local
DeepSeek V3DeepSeek 128K $0.27 $1.10 87.1% 59.1% 82.6% text Cloud + Local
DeepSeek R1DeepSeek 128K $0.55 $2.19 90.8% 71.5% 96.3% text (reasoning) Cloud + Local
Qwen 2.5 72BAlibaba 128K $0.40 $1.20 86.1% 49.0% 88.4% text Cloud + Local
Small / on-device · local-first
Phi-4 (14B)Microsoft 16K $0.07 $0.14 84.8% 56.1% 82.6% text Local-first
Gemma 2 27BGoogle 8K $0.20 $0.20 75.2% 28.8% 71.4% text Local-first

Notes: Cloud only = closed-weights model, accessed via the provider's API. Cloud + Local = open weights downloadable for self-hosting; same model also available on the provider's hosted API. Local-first = open weights designed for on-device or air-gapped deployment. Cost is the published list rate per million tokens for input and output respectively; volume discounts and committed-spend tiers are not reflected. MMLU, GPQA, HumanEval are vendor-published benchmark scores as of 2026-05-04. Treat as directional, not gospel; benchmark contamination is real and methodology varies between providers.

How buyers should read this

A few patterns the table makes obvious once it's side-by-side.

  • Frontier models cost ~10× more on output than input. Output dominates total cost on chat/completion workloads. Budget on output token volume, not input.
  • Open-weight models with the right deployment infra are 5–20× cheaper than frontier proprietary on hosted APIs and free for self-hosted (modulo your own compute). DeepSeek R1 in particular gets close to o3 on reasoning at a fraction of the price.
  • Context window varies by 16× across the cohort - Gemini 2.5 Pro's 2M-token window is its singular differentiator. Most workloads don't need it; some (long-document analysis, code-base RAG) absolutely do.
  • Multimodal is a flag, not a spectrum. Either the model accepts and reasons over images / audio / video natively, or it doesn't. If your tool needs vision, half the cohort drops out.
  • Local-first models (Phi-4, Gemma 2) trade benchmark performance for the ability to run on a single workstation or air-gapped server. The right choice for healthcare / finance / public-sector buyers with strict data-residency requirements.
  • Benchmark numbers are directional. MMLU saturates above ~85%; GPQA is the harder discriminator at the frontier. Pick on the benchmark that matches your workload, not the highest single number.

The Yardstick platform list is what most buyers actually evaluate - pre-built tools that wrap one or more of these models with a workflow. This page is for the buyer who wants to know what's under the hood, or who is considering BYO-LLM via Bedrock / Azure OpenAI / Vertex AI / a self-hosted endpoint.

Methodology

Where these numbers came from.

Benchmark scores, context windows, and pricing are pulled from each provider's public model card or pricing page as of 2026-05-04. We do not run the benchmarks ourselves - benchmark methodology is vendor-controlled and contamination is real, so treat the numbers as the floor of vendor honesty rather than independent measurements. When two versions of the same model family report different scores, we use the production-default tier's numbers. The Yardstick methodology applies the same evidence-labelling discipline (CITED with URL, never fabricated) to model data as it does to platform tear-sheets.

See the platforms that use these models →

Take the free 4-minute readiness audit.

Get your score, peer benchmarks, and three tailored vendor recommendations. No email required to see your results.