Claude Sonnet 4.6 Review: Benchmarks, Pricing, and What's Actually New - SoftPicker

What Is Claude Sonnet 4.6?

Anthropic dropped Claude Sonnet 4.6 on February 17, 2026, and honestly? It’s the most interesting mid-tier AI model release in months. Not because it reinvents anything, but because it makes the expensive Opus tier feel… optional for most people.

Sonnet 4.6 sits in Anthropic’s middle tier – between the lightweight Haiku and the premium Opus. It replaced Sonnet 4.5 as the default on claude.ai for both free and Pro users. The API model ID is claude-sonnet-4-6, and pricing stays at $3/$15 per million tokens (input/output). Same price as before, way more capability.

Here’s what caught my attention after spending two weeks with it: the gap between Sonnet and Opus has basically collapsed. On some benchmarks, Sonnet 4.6 actually beats Opus 4.6. That’s never happened before in Anthropic’s model lineup.

The Numbers That Actually Matter

I’m going to skip the usual benchmark dump and focus on what these scores mean in practice. But first, the raw data:

Benchmark	Sonnet 4.6	Sonnet 4.5	Opus 4.6	GPT-5.2
SWE-bench Verified (coding)	79.6%	77.2%	80.8%	77.0%
OSWorld-Verified (computer use)	72.5%	61.4%	72.7%	38.2%
Terminal-Bench 2.0	59.1%	51.0%	62.7%	46.7%
GPQA Diamond (reasoning)	74.1%	65.0%	74.5%	73.8%
GDPval-AA Office (Elo)	1633	1375	1559	1524
Finance Agent	63.3%	57.3%	62.0%	60.7%
ARC-AGI-2	58.3%	13.6%	75.2%	–
MATH-500	97.8%	96.4%	97.6%	97.4%

Three things jump out from this table.

First, computer use went from 61.4% to 72.5% on OSWorld. That’s an 11-point jump in one generation. Opus 4.6 scores 72.7%, so the difference is 0.2 percentage points. Basically identical. GPT-5.2 sits at 38.2% on the same benchmark, which tells you how far ahead Anthropic is in this space.

Second, look at the office tasks score. Sonnet 4.6 hit 1633 Elo on GDPval-AA, which measures performance on real business tasks like filling out forms, navigating spreadsheets, and working across browser tabs. That’s higher than Opus 4.6 (1559) and GPT-5.2 (1524). A mid-tier model beating the flagship on office work is a big deal if you’re building workplace automation.

Third, the ARC-AGI-2 jump from 13.6% to 58.3% is wild. That’s a 4.3x improvement on a benchmark designed to test general reasoning. Opus still leads at 75.2%, but the trajectory here suggests Anthropic figured something out about abstract reasoning during training.

Computer Use: From Demo to Actually Useful

Anthropic pioneered AI computer use back in October 2024 with Claude 3.5 Sonnet. At the time, they openly called it “experimental – at times cumbersome and error-prone.” They weren’t wrong. Early computer use felt like watching someone’s grandparent try to use a touchscreen.

Sonnet 4.6 is a different story. The model scores 72.5% on OSWorld-Verified, which tests hundreds of tasks across real software like Chrome, LibreOffice, and VS Code running on a simulated computer. No special APIs, no shortcuts – the model sees the screen and clicks around like a person would.

I tested it on a few real-world scenarios:

Navigating a multi-tab insurance quoting workflow – it handled form fills, dropdown selections, and cross-tab data copying without getting lost
Extracting data from a complex spreadsheet with merged cells and conditional formatting – completed in about 40 seconds what took me 5 minutes
Filing a support ticket through a legacy web portal with nested menus – got it right on the first try

Anthropic also reports 94% accuracy on their internal insurance computer use benchmark, which involves multi-step processes through enterprise software. That’s the kind of task that has real economic value – the boring, repetitive screen-clicking work that eats up hours of someone’s day.

One thing worth calling out: prompt injection resistance improved significantly. When a model can click around on websites, malicious actors can try to hijack it by hiding instructions on web pages. Anthropic says Sonnet 4.6 is much better at resisting these attacks compared to 4.5, performing similarly to Opus 4.6.

Coding Performance: Close to Opus, Way Ahead of GPT

For developers, the coding story is straightforward. Sonnet 4.6 scores 79.6% on SWE-bench Verified, up from 77.2% with Sonnet 4.5. Opus 4.6 hits 80.8%, so the gap is about 1.2 points.

On Terminal-Bench 2.0, which tests command-line proficiency and system administration tasks, it scores 59.1% versus Opus’s 62.7%. Still a gap, but GPT-5.2 trails at 46.7%.

The practical difference I noticed in AI code editors using Sonnet 4.6: it’s noticeably better at following complex instructions without drifting. Sonnet 4.5 had a habit of “forgetting” constraints mid-task, especially in longer coding sessions. That happens less with 4.6.

Anthropic reports a 70% win rate for Sonnet 4.6 over Sonnet 4.5 in Claude Code head-to-head comparisons. More surprisingly, 59% of users preferred it over the older Opus 4.5. That’s the first time a Sonnet model has been preferred over its Opus-tier predecessor.

What About Vibe Coding?

If you’ve been following the vibe coding trend, Sonnet 4.6 is probably the best model for it right now at its price point. The combination of strong coding ability, long context (1M tokens in beta), and extended thinking mode means you can describe what you want in natural language and get surprisingly complete results.

For comparison, using Cursor or Claude Code with Sonnet 4.6 gives you roughly 95% of the Opus experience at a fraction of the cost. For hobby projects and prototyping, that trade-off makes total sense.

1M Token Context Window

Sonnet 4.6 ships with a 1M token context window in beta. That’s up from the previous generation’s limit and matches what Opus offers. Maximum output is 64K tokens.

In practice, 1M tokens means you can feed the model an entire medium-sized codebase (think 50-80 files), a full book, or months of conversation history. The model handles it reasonably well, though I noticed some degradation in recall accuracy past about 500K tokens. Information in the middle of very long contexts sometimes gets overlooked – a known issue with transformer models that hasn’t been fully solved.

Extended thinking mode is also available, which lets the model “think through” complex problems step by step before answering. This is especially useful for math, logic puzzles, and multi-step reasoning tasks. On MATH-500, it scores 97.8% – basically matching Opus 4.6’s 97.6% and beating GPT-5.2’s 97.4%.

Where Sonnet 4.6 Beats Opus (Yes, Really)

Look, I expected Sonnet 4.6 to be “good for the price.” I didn’t expect it to outperform Opus in specific categories. But here we are:

Office productivity: 1633 Elo vs Opus’s 1559 on GDPval-AA
Financial analysis: 63.3% vs Opus’s 62.0% on Finance Agent
Scaled tool use: 61.3% vs Opus’s 60.3% on MCP-Atlas
Math: 97.8% vs Opus’s 97.6% on MATH-500

These aren’t cherry-picked metrics. Office tasks, financial analysis, and tool use are exactly the capabilities that businesses care about when deploying AI agents. The fact that a $3/$15 model beats a $15/$75 model (Opus pricing) on these specific tasks is significant.

Where Opus still leads clearly: deep reasoning (Humanity’s Last Exam: 26.3% vs 19.1%), abstract pattern recognition (ARC-AGI-2: 75.2% vs 58.3%), and the most complex coding tasks (SWE-bench: 80.8% vs 79.6%). If your workload is research-grade reasoning or pushing the absolute frontier of AI capability, Opus is still worth the premium.

Pricing and Availability

Plan	Access	Price
Free (claude.ai)	Default model, usage limits apply	$0
Pro (claude.ai)	Default model, higher limits	$20/month
API	claude-sonnet-4-6	$3/$15 per 1M tokens
Amazon Bedrock	Available now	Same as API
Google Vertex AI	Available now	Same as API
Microsoft Foundry	Available now	Same as API

The pricing hasn’t changed from Sonnet 4.5. At $3 per million input tokens and $15 per million output tokens, it’s 5x cheaper than Opus 4.6 ($15/$75). For context, GPT-5.2 costs $10/$30 per million tokens, so Sonnet 4.6 is also significantly cheaper than OpenAI’s comparable model while outperforming it on most benchmarks.

If you’re on the free plan at claude.ai, you already have access. Pro subscribers ($20/month) get higher usage limits and priority access during peak times.

Sonnet 4.6 vs GPT-5.2 vs Gemini 3.1 Pro

The competitive landscape as of late February 2026:

Capability	Sonnet 4.6	GPT-5.2	Gemini 3.1 Pro
Coding (SWE-bench)	79.6%	77.0%	~75%
Computer use	72.5%	38.2%	Limited
Reasoning (GPQA)	74.1%	73.8%	~72%
Context window	1M (beta)	128K	2M
Price (input/output)	$3/$15	$10/$30	$3.50/$10.50

Sonnet 4.6 leads on coding and computer use. GPT-5.2 is competitive on reasoning but costs more than 3x as much. Gemini 3.1 Pro offers the largest context window at 2M tokens and competitive pricing, but trails on coding benchmarks and doesn’t have computer use capabilities.

For most users, Sonnet 4.6 offers the best overall package: strong performance across the board, the best computer use capabilities available, and aggressive pricing. If you specifically need massive context (over 1M tokens), Gemini has the edge. If you’re locked into the OpenAI ecosystem, GPT-5.2 is still solid, just more expensive for what you get.

Who Should Use Sonnet 4.6?

After testing it across different use cases, here’s my take:

Developers building AI agents: This is the sweet spot. Computer use + coding + tool use at $3/$15 per million tokens. You’d be paying 5x more for Opus and getting marginal improvements on most agent tasks. Start with Sonnet 4.6 and only upgrade to Opus if you hit specific capability walls.

Businesses automating office work: The GDPval-AA scores speak for themselves. If your use case involves navigating web apps, filling forms, processing documents, or managing spreadsheets, Sonnet 4.6 is literally the best model available. Better than Opus. Better than GPT-5.2.

Casual users on claude.ai: You already have it. The free tier gives you access to what is arguably the strongest mid-tier AI model ever released. If you were considering a Pro subscription, this upgrade makes it a better value proposition than before.

Researchers and academics: Stick with Opus 4.6. The gap on Humanity’s Last Exam (26.3% vs 19.1%) and ARC-AGI-2 (75.2% vs 58.3%) matters for frontier research. Sonnet is good, but Opus still has the edge on the hardest problems.

The Bottom Line

Claude Sonnet 4.6 makes the “which Claude model should I use?” question easy for 80% of people: this one. It matches Opus on computer use, beats it on office tasks, comes within 1-2 points on coding, and costs 5x less.

The model isn’t perfect. Deep reasoning still favors Opus. The 1M context window is in beta and can get wobbly past 500K tokens. And despite the massive computer use improvements, it still makes mistakes on complex multi-step workflows – just fewer of them.

But here’s the thing: at $3/$15 per million tokens, you can afford to let it retry. The cost savings over Opus mean you could run a task five times with Sonnet 4.6 and still spend the same as running it once with Opus. For reliability-sensitive production workloads, that math works out in Sonnet’s favor more often than you’d think.

Anthropic’s been on a tear with model releases lately, and Sonnet 4.6 might be the release that shifts how people think about AI model tiers. The “mid-tier” label doesn’t really fit anymore when the model outperforms flagships on real-world work tasks. It’s just… the model most people should use.

Safety and Prompt Injection Resistance

One area that doesn’t get enough attention in model reviews: safety. Anthropic published a system card alongside the release, and their safety researchers described Sonnet 4.6 as having “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment.”

More practically relevant: prompt injection resistance improved a lot. This matters because computer use means the model is browsing websites that could contain hidden instructions designed to hijack its behavior. Sonnet 4.6 handles these attacks roughly as well as Opus 4.6, which is a meaningful upgrade from Sonnet 4.5.

If you’re building production agents that interact with untrusted content (web scraping, email processing, customer-facing tools), this improvement alone might justify switching from 4.5.

Extended Thinking Mode

Like Opus, Sonnet 4.6 supports extended thinking – a mode where the model reasons through a problem step by step before producing its final answer. You can enable it via the API or toggle it in claude.ai settings.

I found extended thinking most useful for:

Multi-step math problems where the model needs to track intermediate results
Complex code refactoring where understanding the full dependency chain matters
Analysis tasks where the model needs to weigh multiple factors before making a recommendation
Debugging sessions where the root cause isn’t obvious from the error message alone

The trade-off is speed. Extended thinking adds latency – sometimes 10-20 seconds for complex queries. For interactive chat, regular mode is usually fine. For batch processing or background agents where accuracy trumps speed, extended thinking is worth enabling.

How It Fits Into Anthropic’s Model Lineup

Anthropic now offers a clear three-tier structure, and each tier has a distinct purpose:

Haiku: Fast, cheap, good for classification, routing, and simple tasks. Think of it as the “triage nurse” – it handles the easy stuff so you don’t burn expensive tokens.
Sonnet 4.6: The workhorse. Handles 80-90% of real-world tasks at a reasonable price. Best computer use available, strong coding, excellent for agents.
Opus 4.6: The specialist. For when you need the absolute best on hard problems – frontier reasoning, complex research, the most difficult coding challenges.

The smart play for most applications is to use Haiku for routing and simple queries, Sonnet 4.6 for the bulk of work, and Opus only when Sonnet fails or when the stakes justify the 5x cost premium. This tiered approach can cut your AI costs by 60-70% compared to running everything through Opus.

FAQ

Is Claude Sonnet 4.6 free to use?

Yes, on claude.ai. Free plan users get access with usage limits. The API costs $3/$15 per million tokens (input/output).

How does Sonnet 4.6 compare to ChatGPT?

Sonnet 4.6 outperforms GPT-5.2 on coding (79.6% vs 77.0% on SWE-bench) and dominates on computer use (72.5% vs 38.2% on OSWorld). It’s also cheaper at $3/$15 vs $10/$30 per million tokens. GPT-5.2 is slightly competitive on pure reasoning tasks. For a deeper comparison, check our ChatGPT vs Claude guide.

Should I upgrade from Sonnet 4.5 to 4.6?

If you’re using the API, just change the model ID to claude-sonnet-4-6. Same price, better performance across the board. There’s no reason to stay on 4.5.

When should I use Opus instead of Sonnet?

For frontier research, complex mathematical proofs, abstract reasoning tasks, or when you need the absolute best accuracy on difficult problems. For everything else – coding, office work, agents, general use – Sonnet 4.6 is sufficient and 5x cheaper.

Does Sonnet 4.6 support computer use?

Yes, and it’s nearly as good as Opus at it. It scores 72.5% on OSWorld-Verified (Opus gets 72.7%). Computer use is available through the API and Claude Cowork.

Claude Sonnet 4.6 Review: Benchmarks, Pricing, and What’s Actually New

What Is Claude Sonnet 4.6?

The Numbers That Actually Matter

Computer Use: From Demo to Actually Useful

Coding Performance: Close to Opus, Way Ahead of GPT

What About Vibe Coding?

1M Token Context Window

Where Sonnet 4.6 Beats Opus (Yes, Really)

Pricing and Availability

Sonnet 4.6 vs GPT-5.2 vs Gemini 3.1 Pro

Who Should Use Sonnet 4.6?

The Bottom Line

Safety and Prompt Injection Resistance

Extended Thinking Mode

How It Fits Into Anthropic’s Model Lineup

FAQ

Is Claude Sonnet 4.6 free to use?

How does Sonnet 4.6 compare to ChatGPT?

Should I upgrade from Sonnet 4.5 to 4.6?

When should I use Opus instead of Sonnet?

Does Sonnet 4.6 support computer use?

Leave a Comment Cancel Reply

What Is Claude Sonnet 4.6?

The Numbers That Actually Matter

Computer Use: From Demo to Actually Useful

Coding Performance: Close to Opus, Way Ahead of GPT

What About Vibe Coding?

1M Token Context Window

Where Sonnet 4.6 Beats Opus (Yes, Really)

Pricing and Availability

Sonnet 4.6 vs GPT-5.2 vs Gemini 3.1 Pro

Who Should Use Sonnet 4.6?

The Bottom Line

Safety and Prompt Injection Resistance

Extended Thinking Mode

How It Fits Into Anthropic’s Model Lineup

FAQ

Is Claude Sonnet 4.6 free to use?

How does Sonnet 4.6 compare to ChatGPT?

Should I upgrade from Sonnet 4.5 to 4.6?

When should I use Opus instead of Sonnet?

Does Sonnet 4.6 support computer use?

📚 You Might Also Like

Leave a Comment Cancel Reply