Claude Opus 4.7 Is Out. Here's What Actually Changed for Building a Claude Agent.
- Martin Chen
- 1 day ago
- 10 min read
Anthropic shipped Claude Opus 4.7 on April 16, with a 1 million-token context window, a new xhigh effort tier, a dedicated Claude Code review command, and a claimed 14% improvement on long-running agent workflows at fewer tokens used.
The list-price number didn't move. The real bill for running a claude agent did, because a new tokenizer quietly changed how many tokens the same prompt costs. Anthropic also conceded, more explicitly than it usually does, that 4.7 is not the frontier model it has internally. That one is still called Mythos, and it hasn't shipped.
If you're building production agents, the SWE-bench jump is the least interesting thing in this release. What matters are three API-surface changes that redefine what agent development actually looks like on top of Anthropic's stack, and one undisclosed cost shift that most of the launch coverage skipped.
What Shipped on April 16
Opus 4.7 is not a new generation. It's the first Anthropic release where the agent surface area matters more than the model weights.
The Claude Opus 4.7 announcement lists the headline numbers: API model ID claude-opus-4-7, 1M input / 128K output context, and pricing that stays at $5 per million input tokens and $25 per million output tokens, identical to Opus 4.6. GA rolled out the same day on AWS Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Cursor made it the default model on day one. GitHub Copilot CLI added it within hours.
The interesting additions are not in the model card. They are in the developer surface.
The first is xhigh, a new effort tier that sits between high and max. It is already the default in Claude Code for Pro, Max, Teams, and Enterprise subscriptions. A developer who upgraded without changing any flags is now running at a higher compute tier than they were a week ago.
The second is task budgets, shipping as a public beta. You hand the model a total token target for an agent loop, covering thinking, tool calls, and output, and the model is aware of the countdown while running. It prioritizes accordingly rather than exhausting the budget on exploration. Opt in via the task-budgets-2026-03-13 beta header.
The third is /ultrareview, a slash command in Claude Code that triggers a dedicated multi-stage code review with fresh context on each pass. Pro and Max users get three free runs per period, after which it is metered.
Vision bumped too. Supported image resolution now reaches 2,576 pixels on the long edge, roughly 3.75 megapixels, about three times the prior Claude ceiling. Long-running agent reliability, per Anthropic's own numbers, is up 14% on complex multi-step workflows at fewer tokens used, with roughly one-third the tool-call errors of Opus 4.6. Those numbers come from internal testing with launch partners, not an independent benchmark.
For agent builders, the practical read is that Anthropic shipped three new primitives in the developer surface and one capability delta in the model. The primitives are what change your code. The model delta is what changes your marketing.
The Three Things a claude agent Builder Actually Cares About
Task budgets, xhigh effort, and `/ultrareview` are the actual product. The model weights are table stakes.
Start with task budgets, because they quietly reshape how agents are priced. A typical agent loop used to have two cost surfaces: input tokens and output tokens. 4.7 adds a third, which is the length of the loop itself. With task budgets, you set the ceiling, and the model sees it while running, which means it allocates actions rather than spending its wallet on exploration. A research agent that used to burn 140K tokens to reach a conclusion can now be told to finish within an 80K budget and will self-truncate, choosing which tool calls to keep and which to skip.
This is the first time Anthropic has treated budget as an in-band signal rather than a billing artifact. The long-running agent economics deep dive frames it well: for long-running agents, task budgets plus the new tokenizer materially change the cost model, even without any headline pricing move. Builders who have been writing ad-hoc "stop after N turns" logic into their loops now have a first-class primitive for the same idea.
xhigh effort is the quiet default change. Claude Code shipped with xhigh as the new baseline, which means every existing agent running on Claude Code is now operating at a higher compute tier. The corresponding quality gain comes with a proportional increase in tokens consumed. If you don't explicitly pass effort=high, your production runs are more expensive today than they were last week. For a team that runs Claude Code at any meaningful scale, that is a configuration change worth auditing, not ignoring.
/ultrareview is Anthropic's first dedicated claude code review surface. Each pass runs with fresh context, so feedback on one file doesn't get polluted by a long session's earlier tangents. Pro and Max users get three free runs per period; after that, it is metered. This matters less because of the review quality, which will need independent testing, and more because it signals Anthropic is carving out specific agent modes as distinct metered products rather than treating them as applications of the general model.
The capability claim most agent builders actually want to verify comes from Scott Wu at Cognition, whose team has Opus 4.7 live as an Agent Preview in Devin. Wu, in a Devin Agent Preview post, said Anthropic has "clearly optimized Claude Opus 4.7 for long-horizon autonomy, unlocking a class of deep investigation work we couldn't reliably run before." Works coherently for hours, in other words. That is the concrete capability claim, and it is worth more than the SWE-bench number as a signal of what 4.7 can and can't do in a real agent.
The tangible adoption data point is Cursor. Internal benchmark CursorBench moved from 58% with Opus 4.6 to 70% with 4.7, a 12-point jump, which the Cursor team cited when making the model the day-one default. If your development setup uses Cursor, you've already been running the upgrade.
Replit, also cited in Anthropic's launch post, said the bland-but-expensive thing: same quality at lower cost for log analysis, bug hunting, and fix proposals. That framing is what unlocks line-item migrations inside enterprise customers, where the question is not "is this model smarter" but "can I justify the spend to the VP who signs off on the billing." Notion reported the same +14% and one-third-tool-error-reduction numbers in its own agent workflows, which matters because Notion is one of the few launch partners whose agent use case is not obviously coding-shaped. If the reliability gains hold up outside of coding, that is the signal worth tracking over the next month.
The Stealth Price Increase No One's Talking About
Opus 4.7's list price is the same as 4.6. The per-task price isn't. A new tokenizer does the work the price sheet didn't.
Anthropic did not raise the $5 input / $25 output numbers. They didn't have to. The tokenizer in 4.7 is different, and it tokenizes the same strings at roughly 1.0 to 1.35 times the prior rate. Typical coding prompts, with their dense punctuation and repeated identifiers, land on the higher end of that range.
The math is straightforward. A prompt that cost you 10,000 tokens on Opus 4.6 can cost 12,000 to 13,500 tokens on 4.7. Same words, same price per token, higher bill. Combined with xhigh as the new Claude Code default, the typical developer sees a measurable cost increase on day one even without changing any code.
Community reaction has been sharper than the usual post-launch noise. A community writeup calling it the worst release shipped lists three complaints: tokenizer bloat, the default-effort cost shift, and perceived regressions in creative writing. A top Reddit thread, "Claude Opus 4.7 is a serious regression, not an upgrade," has pulled roughly 2,300 upvotes and catalogues hallucinations (the infamous "strawberry" spelling has returned), résumé rewrites with fabricated schools, and more confirmation loops that burn tokens without adding signal.
Not all of this is fair. A meaningful share of the complaints overlap with the reaction Anthropic gets on every release, where users calibrated to the prior model experience the new defaults as regressions until they adjust their prompts. The specific claim about tokenizer bloat, however, is independently reproducible and is the rare release where skeptical developer writeups and Anthropic's own documentation technically agree. The tokenizer is new. New tokenizers tokenize differently. Bills reflect that.
The honest version of the release note would be: same price per token, more tokens per task, so your monthly bill goes up even if your agent code doesn't change. The stated angle is "no price change." Both are true. The meaningful one for a builder is the second. If you ship an agent that runs at scale, you want a week of production data before you decide whether the quality gain pays for the token inflation, and you want that data before the April 30 promotional 7.5× multiplier in Claude Code expires, because the real cost picture lives on the far side of that date.
Where 4.7 Stands vs GPT-5.4 and Gemini 3.1 Pro
Opus 4.7 narrowly retakes the coding and tool-use lead. On terminals and browsing, it still trails.
The benchmark picture is mixed, which is the accurate reading. On SWE-bench Verified, Opus 4.7 scores 87.6%, up from 80.8% on 4.6, ahead of GPT-5.3-Codex at 85.0% and Gemini 3.1 Pro at roughly 80.6%. On MCP-Atlas, a tool-use benchmark, 4.7 leads at 77.3%, ahead of Gemini 3.1 Pro at 73.9% and GPT-5.4 at 68.1%. That is the strongest competitive win for the claude opus 4 family in this release.
Terminal-Bench 2.0 tells a different story. Opus 4.7 scores 69.4%, which trails GPT-5.4's 75.1%. BrowseComp is worse: 4.7 comes in at 79.3%, a 4.4-point regression from 4.6's 83.7%, and well behind Gemini 3.1 Pro at 85.9% and GPT-5.4 Pro at 89.3%. On GPQA Diamond, the reasoning benchmark, all three are effectively tied at around 94%.
The "narrowly retaking the lead" framing captures the shape: "narrowly retaking the lead." This is not a generational jump. It is a release that wins some benchmarks, loses others, and ships better developer primitives than the competition.
The practical translation for agent builders: if your setup is coding-centric and tool-heavy, 4.7 is the default pick. If it is terminal-ops heavy or browse-heavy, you probably want to keep GPT-5.4-Codex or Gemini 3.1 Pro in rotation, either as the primary or as a second opinion.
The more revealing context is the Mythos concession, where Anthropic publicly admits that Opus 4.7 is not its frontier model. The company's internal Mythos model outperforms 4.7 on Anthropic's own evaluations and has not shipped. It is unusual for a major lab to explicitly tell the market that the thing it is releasing today is not the best thing it has. Read the release with that framing in mind: Opus 4.7 is a waypoint, not a peak.
What This Means for the Agent Tooling Layer
The 4.7 release validates what Cursor, Devin, and Replit bet on a year ago: the model is the commodity; the agent shell is the product.
Day-one default adoption in Cursor, GitHub Copilot CLI, and Devin's Agent Preview is the signal. IDE and agent vendors no longer wait for developer demand before cutting over to a new model. They switch immediately, because the differentiation they sell is no longer "which model do we wrap." It is "which shell makes the model usable for real work."
Replit's framing, which Anthropic quoted in the launch post, was plain: same quality at lower cost for log analysis, bug hunting, and fix proposals. That is the blandest compliment a model can get, and it is also the one that unlocks budget-line migrations inside enterprise customers. Notion, also cited in the post, reported the +14% agent workflow improvement and roughly one-third tool-call error reduction internally.
The task budgets and /ultrareview combination is Anthropic's response to the reality that end users don't build with the raw API anymore. They build with Claude Code, with Cursor, with a partner's agent shell. By moving features into the shell, not just the model, Anthropic is competing where the customer actually lives.
Competitive implication: OpenAI's Codex and Google's Gemini agents are built on the same assumption. The race has shifted from "who has the smartest base model" to "who owns the surface where agents are built." For anyone building on Anthropic's stack, the decision tree is no longer "which model to call." It is "which shell to build on, and does that shell expose task budgets, effort controls, and code review as first-class primitives." Tools like a well-structured AI-native second brain start looking less like research curiosity and more like the context layer an agent shell needs to ground its runs.
What's Next
Expect Mythos before summer, and expect Anthropic to keep pushing the agent surface ahead of model weights.
Short term, the 7.5× promotional request multiplier in Claude Code runs through April 30. Real cost data, on the new tokenizer at the new default effort, arrives in early May. Expect the community to re-benchmark the tokenizer bloat claim with clean after-promotion data, and expect Anthropic to respond with either a documentation update or a quiet adjustment to the Claude Code defaults.
Medium term, three to six months, Mythos is the elephant in the room. Anthropic has explicitly admitted a bigger model exists internally. Release cadence across Opus 4.6 → 4.7 suggests a summer window for Mythos-class release. The more interesting signal is that Anthropic has telegraphed a product direction: the next model will be measured as much by what agent surface ships with it as by its benchmark numbers.
Longer term, task budgets are likely to graduate from beta to GA and become the default way agent loops are priced, billed, and surfaced in developer tooling. /ultrareview is the template for dedicated agent modes; expect the same pattern to show up for data analysis, research, and ops as distinct metered products rather than applications of the same model.
The open question for developers is whether Anthropic's strategy, of shipping new agent surfaces alongside modest model gains, holds up as GPT-5.4 and Gemini 3.1 Pro add comparable primitives. Model-weight leads are harder to defend than product surfaces. Product surfaces are harder to defend than ecosystems. Whoever ends up owning the claude agent layer, or the competing Codex and Gemini layers, will have done so by building the shell, not by winning the benchmark.
Is 4.7 Worth Upgrading to Today?
If you run a coding-heavy claude agent on Claude Code, the xhigh default and /ultrareview are worth the upgrade, even factoring in the stealth tokenizer inflation. If you run a browse-heavy one, you probably want GPT-5.4-Codex or Gemini 3.1 Pro still in the mix. Either way, the thing to budget for is not the upgrade itself but the ongoing audit: what does your agent cost per task now, and what did it cost a week ago? Once you've got that answer, how you recall your work memory across agent sessions starts to matter as much as the model underneath it. Opus 4.7 is a waypoint, not a peak. Build for the waypoint, but don't confuse it with the destination.