On May 28, 2026, Anthropic released Claude Opus 4.8 alongside a $65 billion funding round and a promise that Mythos-class models were coming soon. The model arrived packaged with dynamic workflows, a cheaper fast mode, and improvements across coding, reasoning, and knowledge work. Most of those updates landed within what the AI industry has come to expect from an incremental flagship release. One did not. On the 2026 USA Mathematical Olympiad benchmark, Opus 4.8 scored 96.7 percent — up from 69.3 percent on Opus 4.7. The 27.4-point gain is the largest single-cycle math improvement in the Opus line's history, and it signals something about the state of mathematical reasoning in frontier AI that goes beyond any single number.
What It Takes to Prove Something
The USA Mathematical Olympiad is not a multiple-choice quiz. Problems require constructing rigorous multi-step proofs — identifying which approach to take, ruling out dead ends, and producing a logically watertight argument from premises to conclusion. Performance cannot be gained by pattern-matching to a training corpus; the problems change each year, and solutions require original reasoning. A score of 69 percent means a model is getting roughly seven out of ten problems right on average. A score of 97 percent means it is nearly solving the full exam.
The improvement matters because it suggests Opus 4.8's gains are qualitative, not just quantitative. Anthropic's adaptive thinking system — which triggers extended reasoning only when a task demands it — appears to have become more precise about when to engage deeper problem decomposition on mathematical tasks. The model also needs 15 percent fewer passes per task and 35 percent fewer output tokens on the GDPval-AA benchmark compared to Opus 4.7. Doing more with less, across reasoning tasks, is a cleaner proof of progress than a benchmark score alone.
Claude Opus 4.8: Key Benchmark Results
- USAMO 2026 (math proofs)96.7% (up from 69.3% on Opus 4.7)
- Single-cycle math gain+27.4 percentage points
- Artificial Analysis Intelligence Index61.4 (up 4.1 pts, leads GPT-5.5)
- SWE-Bench Pro (agentic coding)69.2% vs GPT-5.5's 58.6%
- GDPval-AA Elo (knowledge work)1,890 (~67% win rate vs GPT-5.5)
- API pricing$5 / $25 per MTok (unchanged from Opus 4.7)
The Full Picture
The USAMO result is the headline, but it does not represent Opus 4.8's standing across every dimension. On Terminal-Bench 2.1 — which covers shell scripting, system administration, and command-line tool usage — Opus 4.8 scores 74.6 percent against GPT-5.5's 78.2 percent. OpenAI's model maintains a meaningful lead there, and developers working on infrastructure automation, devops pipelines, or anything involving complex Bash scripting should account for that gap.
Elsewhere, Opus 4.8 pulls ahead consistently. On SWE-Bench Pro, Anthropic's agentic software engineering benchmark, the model hits 69.2 percent against GPT-5.5's 58.6 percent. On GDPval-AA — an Elo-style ranking across real-world knowledge work tasks — Opus 4.8 sits at 1,890, implying roughly a 67 percent head-to-head win rate against GPT-5.5. The Artificial Analysis Intelligence Index, a composite metric across multiple evaluations, places Opus 4.8 at 61.4, up 4.1 points from Opus 4.7 and 1.2 points ahead of GPT-5.5, which had held the top position. The result is a model that outperforms its primary competitor across most measured dimensions but not all — a more credible claim than a clean sweep, and more actionable information for teams making deployment decisions.
"Opus 4.8 is roughly four times less likely than its predecessor to allow flaws in code it has written to pass unremarked." Anthropic, Claude Opus 4.8 launch announcement, May 28, 2026
From Benchmark to Business Value
The practical question for any organization considering Opus 4.8 is whether math benchmark improvements translate to business value. In most cases they do, though the path is indirect. Models with stronger mathematical reasoning produce more reliable financial analysis, handle scientific computations that earlier models fumbled, and write cleaner code in domains where mathematical correctness matters — numerical methods, signal processing, cryptographic implementations. The gap between a model that scores 69 percent and one that scores 97 percent on rigorous proof construction can represent a meaningful difference in output quality for teams in finance, pharmaceuticals, or engineering.
Anthropic's own release documentation points in the same direction, highlighting agentic financial analysis as one of Opus 4.8's most improved capability areas alongside coding and long-horizon reasoning. The full Opus 4.8 launch also brought mid-conversation system messages, refusal category reporting in the API, and task budget support — developer features that extend the model's usefulness in production agentic applications. All of this ships at the same $5 per million input tokens and $25 per million output tokens as Opus 4.7.
The Pace of Change
Forty-one days separated Claude Opus 4.7 from Opus 4.8 — the shortest flagship-to-flagship interval in Anthropic's history. Within that window, the company produced a 27-point math improvement, a new top position on the Artificial Analysis composite index, and a model that surpassed the prior Claude benchmark record on most major evaluations. Whether Opus 4.8's gains in mathematical reasoning compound further in Mythos-class models is now one of the more interesting open questions in frontier AI development.
For developers and enterprise teams, the immediate takeaway is simpler. Opus 4.8 is the strongest generally available model Anthropic has shipped, across most measurable criteria, at no change in price. For workflows involving analysis, reasoning, or complex knowledge work, upgrading from Opus 4.7 carries no cost penalty and a clear capability gain. Review Claude's full model family to find the right fit for your organization's specific workloads.