Anthropic released Claude Opus 4.7 on April 16, 2026, positioning it as their most capable generally available model with a specific emphasis on agentic reliability. The two headline features are built-in output verification, where the model devises ways to check its own work before reporting back, and a 3x improvement in vision resolution to 3.75 megapixels.
The benchmark numbers tell the expected story of incremental gains over Opus 4.6, but the more interesting development is what Anthropic chose to optimize for. Rather than chasing knowledge benchmarks where frontier models are already within noise of each other, they focused on the reliability dimension that determines whether a model can actually run autonomously over extended workflows without producing compounding errors.
What changed from Opus 4.6
Opus 4.7 is not a new architecture but a refinement of the Opus 4.6 foundation with three targeted improvements. Understanding what changed and why requires looking at where Opus 4.6 fell short in real-world agentic deployments.
Output verification
The most consequential addition is that Opus 4.7 devises ways to verify its own outputs before reporting back. In practice, this means the model performs proofs on systems code before committing to a solution and cross-checks factual claims against its context window. Anthropic does not brand this as a discrete feature, but describes it as an emergent behavior of the model's agentic training.
For agentic workflows where the model operates over dozens of steps without human oversight, this matters more than raw benchmark scores. The difference between a model that sometimes produces plausible-but-wrong output and one that catches its own errors before they propagate is the difference between a useful tool and a liability.
This behavior is distinct from extended thinking. Extended thinking allocates tokens for reasoning about how to approach a problem. Output verification allocates compute for checking whether the answer is correct after the reasoning is complete. Both contribute to output quality, but they address different failure modes: extended thinking reduces errors of reasoning, while output verification reduces errors of execution.
High-resolution vision
Previous Claude models capped image input at 1,568 pixels on the long edge, roughly 1.15 megapixels. Opus 4.7 raises this to 2,576 pixels on the long edge, approximately 3.75 megapixels, which represents more than three times the visual information available per image.
The practical impact is most visible in document understanding, chart reading, and UI analysis. At 1.15 megapixels, fine text in documents and small labels on charts were often below the resolution threshold for reliable extraction. At 3.75 megapixels, these details become readable without preprocessing or tiling workarounds.
For teams using Claude in document processing pipelines or for visual QA over screenshots and diagrams, this is a material improvement in accuracy without any changes to existing prompts or workflows.
Adaptive thinking and xhigh reasoning
Opus 4.7 introduces a new xhigh reasoning tier that sits between high and max, giving developers more granular control over reasoning depth. The recommended configuration is adaptive thinking (thinking: {type: "adaptive"}), which allows the model to dynamically adjust its reasoning depth based on problem complexity rather than requiring the developer to select a fixed tier.
This matters for cost management in production. A fixed max reasoning tier allocates maximum thinking tokens for every request, including simple ones that do not benefit from extended reasoning. Adaptive thinking lets the model use more thinking tokens for complex requests and fewer for straightforward ones, which reduces average cost per request while maintaining quality on the requests that actually need deep reasoning.
Benchmark performance
The benchmark landscape at the frontier is increasingly compressed. On knowledge and reasoning benchmarks, the differences between Opus 4.7, GPT-5.4, and Gemini 3.1 Pro are within noise. Where the models diverge is on agentic and coding tasks, which reflects genuine differences in what each lab is optimizing for.
Agentic and coding
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Pro | 64.3% | 58.1% | 61.2% | 59.8% |
| OSWorld (computer use) | 78.0% | 72.4% | 68.5% | 70.1% |
| CharXiv (visual reasoning) | 82.1% | 76.3% | 79.4% | 80.2% |
SWE-bench Pro is the most meaningful metric here because it measures end-to-end software engineering capability: given a GitHub issue and a codebase, can the model produce a correct patch? Opus 4.7's 64.3% represents a 6.2 percentage point improvement over Opus 4.6. For a benchmark where the top models have been separated by single-digit margins for the past year, that is a substantial jump.
The OSWorld score of 78.0% measures computer use: the ability to interact with a desktop environment through screenshots, mouse clicks, and keyboard input. This is directly relevant to agentic applications where the model needs to navigate software interfaces autonomously.
Knowledge and reasoning
| Benchmark | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| GPQA Diamond | 94.2% | 94.4% | 94.3% |
| General Finance | 0.813 | 0.791 | 0.802 |
On GPQA Diamond (graduate-level science questions), all three models score within 0.2 percentage points of each other. This convergence at the frontier is worth noting: it suggests diminishing returns from scaling alone on knowledge benchmarks and explains why labs are differentiating on capability dimensions like tool use, computer use, and multi-step reliability rather than raw knowledge scores.
The General Finance benchmark tells a different story. Opus 4.7's 0.813 versus Opus 4.6's 0.767 is a meaningful improvement on a domain that requires both knowledge and structured reasoning over numerical data. That is precisely the kind of task where output verification should provide the most benefit, and the numbers bear it out.
Specifications
| Specification | Opus 4.7 |
|---|---|
| Model ID | claude-opus-4-7 |
| Context window | 1M tokens |
| Max output (sync API) | 128K tokens |
| Max output (batch API) | 300K tokens |
| Vision resolution | 3.75 MP (2,576px long edge) |
| Thinking modes | Adaptive, low, medium, high, xhigh, max |
| Input pricing | $5 / 1M tokens |
| Output pricing | $25 / 1M tokens |
| Prompt caching savings | Up to 90% |
| Batch processing savings | 50% |
Pricing is identical to Opus 4.6, which means teams can upgrade without budget impact. The full 1M token context window is billed at the same per-token rate regardless of whether a request uses 9,000 or 900,000 tokens. There is no surcharge for long contexts.
Availability
Opus 4.7 is available across all Claude products (claude.ai, Claude Code, the Claude API) and through cloud providers including Amazon Bedrock, Google Cloud Vertex AI, and Azure. The model ID for the API is claude-opus-4-7.
For teams already using Opus 4.6, migration is a model ID swap. The API interface, tool use patterns, and system prompt format are unchanged. Anthropic's migration guide covers the specifics, but the short version is that any application built for Opus 4.6 will work with Opus 4.7 without code changes beyond updating the model string.
The Mythos question
Anthropic acknowledged that Opus 4.7 does not match the performance of Mythos Preview, an unreleased model that has been made available only to a select group of technology and cybersecurity companies. This is an unusual level of transparency from an AI lab: explicitly conceding that a publicly available model is not their most capable.
The implication is that Anthropic is sitting on capabilities they consider too risky for general release, which aligns with their stated approach to AI safety but raises practical questions about what the capability gap looks like and when it might close.
Safety and the Cyber Verification Program
Opus 4.7 ships with automated safeguards that detect and block requests related to prohibited or high-risk cybersecurity use cases. More notably, Anthropic introduced a Cyber Verification Program that allows security professionals to test the model's capabilities in controlled environments for legitimate purposes like penetration testing and vulnerability research.
This is a pragmatic acknowledgment that blanket restrictions on security-related capabilities create problems for the security community. A model that refuses to discuss vulnerabilities is less useful for defensive security research, which arguably makes the ecosystem less safe, not more. The verification program creates an authenticated channel for legitimate security work without opening the model to unrestricted use.
What this means for model selection
Opus 4.7 is the strongest option for agentic workloads, where reliability over multi-step processes matters more than raw speed or cost per token. The model's ability to verify its own outputs is a genuine differentiator for use cases where it operates with limited human oversight, including code generation, data analysis pipelines, and automated research workflows.
For latency-sensitive or high-throughput applications where cost per token is the primary concern, Claude Sonnet 4.6 or Haiku 4.5 remain better choices. Opus 4.7 is not a speed model, and the output verification and extended reasoning behaviors add latency that does not benefit simple tasks.
For teams evaluating across providers, the competitive dynamics have clarified: the frontier models from Anthropic, OpenAI, and Google are functionally equivalent on knowledge benchmarks, and the meaningful differentiation is now on agentic reliability, tool use, and domain-specific performance. The right choice depends less on "which model is best" and more on which capability dimension matters most for your specific application.
You can compare Opus 4.7 against other frontier models on our model comparison tool or explore the full catalog at inferbase.ai/models.