AI models comparison
The foundation models powering the platforms.
Every AI tool on the Yardstick platform list runs on one of these foundation models - sometimes named, sometimes not. This page compares them directly on the dimensions that matter to a buyer choosing a vendor: benchmark performance, input and output cost per million tokens, context window, multimodal support, and whether the model can run on your own infrastructure (open weights) or only via the provider's cloud. Numbers are vendor-published as of 2026-05-04 with the source linked. Where benchmarks differ across versions of the same model family, we score the production-default tier.
| Model | Context | Input $/M | Output $/M | MMLU | GPQA | HumanEval | Modalities | Deployment |
|---|---|---|---|---|---|---|---|---|
| Frontier proprietary · cloud-only | ||||||||
| GPT-5OpenAI | 400K | $1.25 | $10.00 | 91.4% | 85.7% | 94.6% | text + image + audio | Cloud only |
| GPT-4oOpenAI | 128K | $2.50 | $10.00 | 88.7% | 53.6% | 90.2% | text + image + audio | Cloud only |
| GPT-4o miniOpenAI | 128K | $0.15 | $0.60 | 82.0% | 40.2% | 87.2% | text + image | Cloud only |
| o3OpenAI | 200K | $2.00 | $8.00 | 92.7% | 87.7% | 96.7% | text + image (reasoning) | Cloud only |
| Claude Sonnet 4.5Anthropic | 200K | $3.00 | $15.00 | 89.9% | 83.4% | 93.7% | text + image | Cloud only |
| Claude Opus 4Anthropic | 200K | $15.00 | $75.00 | 90.7% | 84.3% | 95.4% | text + image | Cloud only |
| Claude Haiku 3.5Anthropic | 200K | $0.80 | $4.00 | 76.0% | 41.6% | 88.1% | text + image | Cloud only |
| Gemini 2.5 ProGoogle | 2M | $1.25 | $10.00 | 86.4% | 84.0% | 93.4% | text + image + audio + video | Cloud only |
| Gemini 2.5 FlashGoogle | 1M | $0.30 | $2.50 | 81.7% | 66.7% | 88.7% | text + image + audio + video | Cloud only |
| Grok 3xAI | 131K | $3.00 | $15.00 | 87.5% | 84.6% | 86.5% | text + image | Cloud only |
| Open weights · cloud or local deployment | ||||||||
| Llama 3.3 70BMeta | 128K | $0.59 | $0.79 | 86.0% | 50.5% | 88.4% | text | Cloud + Local |
| Llama 3.1 405BMeta | 128K | $3.50 | $3.50 | 88.6% | 51.1% | 89.0% | text | Cloud + Local |
| Mistral Large 2Mistral | 128K | $2.00 | $6.00 | 84.0% | 48.0% | 91.5% | text | Cloud + Local |
| Codestral 25.01Mistral | 256K | $0.30 | $0.90 | 71.4% | 35.0% | 86.6% | text (code-tuned) | Cloud + Local |
| DeepSeek V3DeepSeek | 128K | $0.27 | $1.10 | 87.1% | 59.1% | 82.6% | text | Cloud + Local |
| DeepSeek R1DeepSeek | 128K | $0.55 | $2.19 | 90.8% | 71.5% | 96.3% | text (reasoning) | Cloud + Local |
| Qwen 2.5 72BAlibaba | 128K | $0.40 | $1.20 | 86.1% | 49.0% | 88.4% | text | Cloud + Local |
| Small / on-device · local-first | ||||||||
| Phi-4 (14B)Microsoft | 16K | $0.07 | $0.14 | 84.8% | 56.1% | 82.6% | text | Local-first |
| Gemma 2 27BGoogle | 8K | $0.20 | $0.20 | 75.2% | 28.8% | 71.4% | text | Local-first |
Notes: Cloud only = closed-weights model, accessed via the provider's API. Cloud + Local = open weights downloadable for self-hosting; same model also available on the provider's hosted API. Local-first = open weights designed for on-device or air-gapped deployment. Cost is the published list rate per million tokens for input and output respectively; volume discounts and committed-spend tiers are not reflected. MMLU, GPQA, HumanEval are vendor-published benchmark scores as of 2026-05-04. Treat as directional, not gospel; benchmark contamination is real and methodology varies between providers.
How buyers should read this
A few patterns the table makes obvious once it's side-by-side.
- Frontier models cost ~10× more on output than input. Output dominates total cost on chat/completion workloads. Budget on output token volume, not input.
- Open-weight models with the right deployment infra are 5–20× cheaper than frontier proprietary on hosted APIs and free for self-hosted (modulo your own compute). DeepSeek R1 in particular gets close to o3 on reasoning at a fraction of the price.
- Context window varies by 16× across the cohort - Gemini 2.5 Pro's 2M-token window is its singular differentiator. Most workloads don't need it; some (long-document analysis, code-base RAG) absolutely do.
- Multimodal is a flag, not a spectrum. Either the model accepts and reasons over images / audio / video natively, or it doesn't. If your tool needs vision, half the cohort drops out.
- Local-first models (Phi-4, Gemma 2) trade benchmark performance for the ability to run on a single workstation or air-gapped server. The right choice for healthcare / finance / public-sector buyers with strict data-residency requirements.
- Benchmark numbers are directional. MMLU saturates above ~85%; GPQA is the harder discriminator at the frontier. Pick on the benchmark that matches your workload, not the highest single number.
The Yardstick platform list is what most buyers actually evaluate - pre-built tools that wrap one or more of these models with a workflow. This page is for the buyer who wants to know what's under the hood, or who is considering BYO-LLM via Bedrock / Azure OpenAI / Vertex AI / a self-hosted endpoint.
Methodology
Where these numbers came from.
Benchmark scores, context windows, and pricing are pulled from each provider's public model card or pricing page as of 2026-05-04. We do not run the benchmarks ourselves - benchmark methodology is vendor-controlled and contamination is real, so treat the numbers as the floor of vendor honesty rather than independent measurements. When two versions of the same model family report different scores, we use the production-default tier's numbers. The Yardstick methodology applies the same evidence-labelling discipline (CITED with URL, never fabricated) to model data as it does to platform tear-sheets.
Take the free 4-minute readiness audit.
Get your score, peer benchmarks, and three tailored vendor recommendations. No email required to see your results.