Playing with Sonnet 4.5

If you’ve been anywhere near AI Twitter, Discords, or your engineering team’s standup this week, you’ve probably felt the buzz around Claude Sonnet 4.5. Anthropic is pitching it as the model for coding, long-running agent workflows, and computer use. I spent time digging through the launch materials, early docs, and independent write-ups, then “played” with the model across typical product, engineering, research, and writing chores. Here’s a practical, opinionated walkthrough of what Sonnet 4.5 changes and where it still needs work.

What actually is Sonnet 4.5?

Sonnet 4.5 is Anthropic’s latest Sonnet-tier model. It's a hybrid-reasoning system positioned as the sweet spot between raw capability and price/performance. The headline claims are bold: “best model in the world” for agents, coding, and computer use, with stronger accuracy on long tasks and improved domain chops in finance and cybersecurity. It ships with a 200K context window and supports up to 64K output tokens, which matters for dense plans, long diffs, and big code edits. Pricing remains $3 per million input tokens and $15 per million output tokens. It’s live in the Claude app, via API, and on AWS Bedrock and Google Cloud Vertex AI.

Under the hood, 4.5 inherits Sonnet 4’s hybrid-reasoning approach (“short answers when you need speed, extended thinking when you need rigor”). Anthropic emphasizes that you can now choose how long the model “thinks,” trading latency and cost for better multi-step reasoning when necessary.

What’s new beyond raw IQ?

Anthropic rolled out a stack around Sonnet 4.5:

• Claude Code upgrades: checkpoints for instant rollbacks, refreshed terminal UX, and a native VS Code extension.

• Agent infrastructure: a Claude Agent SDK (the same scaffolding Anthropic uses internally) now available for building your own multi-tool, multi-step agents.

• Context management features: “context editing” to automatically clear aging tool calls, a memory tool (beta) for state across sessions, and clearer “stop reasons”.

These aren’t just niceties: in long-running automations, context discipline and state hygiene are where most agents actually fall apart. Sonnet 4.5’s additions are aimed directly at those failure modes.

Benchmarks and the “real work” gap

• SWE-bench Verified (real GitHub issues & tests): Sonnet 4.5 posts 77.2% under a simple two-tool scaffold. A high-compute variant with parallel attempts and candidate selection reaches 82.0%.

• OSWorld-Verified (computer use): Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%.

The best way to think about Sonnet 4.5’s benchmark wins is not “instant senior engineer,” but “more reliable foreman” for complex, tool-assisted jobs; especially when you let it spend a bit more “thinking” budget.

Availability, pricing, and where to run it

Sonnet 4.5 is broadly available:

• Claude.ai (web, iOS, Android) for chat; Claude Code for agentic coding.

• Claude API (claude-sonnet-4-5-20250929) with the new context tools.

• AWS Bedrock (anthropic.claude-sonnet-4-5-20250929-v1:0) and Google Cloud Vertex AI (claude-sonnet-4-5@20250929) for enterprise pipelines.

Price: $3 / $15 per million input/output tokens (prompt caching and batch options can trim cost).

Safety posture: ASL-3 and “situational awareness”

Sonnet 4.5 is released under AI Safety Level 3 (ASL-3) with stronger classifiers for CBRN (chemical/biological/radiological/nuclear) risks. The system card flags improved behavior on hazards like sycophancy, deception, and power-seeking.

One surprising result: in external safety evaluations, Sonnet 4.5 sometimes seemed to notice it was being tested and asked evaluators to be transparent about that. Anthropic interprets this as a call for more realistic, less “contrived” evals.

Bottom line: 4.5 looks safer than prior Claude models, but the lab-vs-field gap in agent safety is real. Treat long-running automations like you would a new junior SRE: set scopes, add circuit breakers, and audit logs.

The developer-facing upgrades (and why they matter)

• Checkpoints, VS Code, and better tool calls: inline diffs, tighter edit loops, parallel tool calls, improved token tracking, and clearer stop signals.

• Memory and context editing: persist state beyond a single session and clear stale tool traces before hitting token limits.

Strengths vs. pain points

Where Sonnet 4.5 shines

• End-to-end code work: planning, refactors, test fixes, dependency modernizations.

• Browser/computer use: site navigation, procurement, spreadsheets.

• Agent orchestration: steadier progress, better tool selection, improved state tracking.

• Finance/security: competent in multi-step pipelines and patching workflows.

Where to be cautious

• Extended thinking costs tokens and time.

• Safety filters may occasionally over-block benign content.

• Evaluation awareness suggests teams should rely on live evals.

Setup paths

Scenario	Best Path	Why it’s Good
Solo dev / writer	Claude app + Claude Code	Easy agent + file creation
Startup with cloud mix	Claude API + Agent SDK	Full control; context editing + memory
Enterprise with governance	AWS Bedrock or GCP Vertex AI	IAM, audit, guardrails

Coding: how different does it feel?

4.5 is more likely to outline a plan, write tests up front, and iterate against failing cases before producing a patch. On tough refactors, it maintains coherence across multiple edits and tool calls.

Pro tip: Turn on extended thinking, generate/update tests first, and run the suite before applying patches.

Computer use: beyond “click this, type that”

With a leap from 42.2% → 61.4% on OSWorld, Sonnet 4.5 is steadier in UI navigation and repetitive workflows. For product teams, improved state tracking and parallel tool calls enable pipelined work.

Safety & agent reality

• Scope narrow, expand later

• Record decisions with progress summaries

• Fallbacks: route to Sonnet 4 or humans when blocked

• Live evals: measure performance in messy, real environments

Feature snapshot

Capability	Sonnet 4.5 Behavior
Context window	200K tokens
Output tokens	Up to 64K
Extended thinking	Optional; boosts complex reasoning
Tool use	Parallelization, formatting preserved
Memory tool (beta)	Persist state across sessions
Context editing	Clears stale traces
Computer use	61.4% OSWorld
Coding	77.2–82.0% SWE-bench
Safety level	ASL-3

Pricing & deployment table

Option	Access Point	Use Case	Notes
Claude app	Web, iOS, Android	Small teams	File creation + execution
Claude API	claude-sonnet-4-5-20250929	Integration	$3/$15 per M tokens
AWS Bedrock	anthropic.claude-sonnet-4-5-20250929-v1:0	Enterprise	IAM + governance
Google Vertex AI	claude-sonnet-4-5@20250929	Enterprise	Model Garden / Marketplace

Migration

• From Sonnet 4: swap model name, adopt context tools. Cannot set temperature and top_p together.

• From 3.x: stability gains in tool-heavy jobs. Start small, measure cost/latency, expand.

Hands-on patterns that worked

• Test-first edits

• Plan → checklist → execute

• Context “gardening”

• Progress receipts

Where Sonnet 4.5 fits

Sonnet 4.5 is the default choice for coding, computer use, and agentic jobs where parallel tools and long-horizon stability matter. Creative writing or niche ecosystems may push you elsewhere, but for repetitive, structured work, 4.5 is hard to beat.

The safety conversation we should actually have

• Anthropic’s alignment work: measurable improvements

• Reality: distribution shifts in messy environments remain a risk

Keep scopes tight, log actions, and prefer “ask to act” over full autonomy.

Mini-playbook: rolling out a Sonnet 4.5 coding agent

• Seed prompts: write/update tests first; parallelize repo search + advisory lookup.

• Guardrails: directory allowlist; CI must pass; hard cap edits.

• Context policies: edit at 150K tokens, summarize to memory hourly.

• Human-in-the-loop: mandatory review on sensitive changes.

• Fallbacks: downgrade to Sonnet 4 or escalate to humans.

FAQ

Is it the best coding model?

Excellent SWE-bench numbers (77–82%). Strong default, depends on your stack.

How long can it run?

Reported 30+ hours unattended; supervision still recommended.

Does it hallucinate less?

Improved alignment, but not flawless. Always verify outputs.

Cost control?

Stick to standard mode, use extended thinking selectively, and lean on caching/batching.

The short version

• It’s a leap for coding, computer use, and long-running agents.

• Adopt the stack (checkpoints, context editing, memory, SDK), not just the model.

• Respect the safety envelope and keep humans in the loop.

Who should switch now?

• Engineering teams: backlogs of refactors, tests, and modernizations.

• Ops/Product: spreadsheet-heavy, browser workflows.

• Finance/Security: triage, patch cycles, clean audit workflows.

Sonnet 4.5 doesn’t “do everything” magically but it makes more real work automatable, reliable, and auditable. With the right scaffolding, it feels less like a chatty assistant and more like a steady, tool-savvy teammate that can actually move tickets across the board.

Playing with Anthropic’s Sonnet 4.5

Playing with Sonnet 4.5

What actually is Sonnet 4.5?

What’s new beyond raw IQ?

Benchmarks and the “real work” gap

Availability, pricing, and where to run it

Safety posture: ASL-3 and “situational awareness”

The developer-facing upgrades (and why they matter)

Strengths vs. pain points

Where Sonnet 4.5 shines

Where to be cautious

Setup paths

Coding: how different does it feel?

Computer use: beyond “click this, type that”

Safety & agent reality

Feature snapshot

Pricing & deployment table

Migration

Hands-on patterns that worked

Where Sonnet 4.5 fits

The safety conversation we should actually have

Mini-playbook: rolling out a Sonnet 4.5 coding agent

FAQ

The short version

Who should switch now?

Tags:

About the Author

Related Articles

The Hidden Cost of CRM Dysfunction in Sales Data

The Hidden Complexity of Two-Way Sync Architecture

Dynamic Email Architecture That Personalizes Itself

The Common Customer Data Model as the GTM Brain

From Strategy to System: The Future of GTM Execution

New Feature: Intelligent Calendar View

Auto-Capture in Action

SALES PIPELINE AUTOMATION FAQS

What is GTM Engine and how does it work?

How does GTM Engine capture and process data?

What tools does GTM Engine integrate with?

How secure is GTM Engine?

What outcomes can I expect?

Who is GTM Engine built for?

How long does it typically take to get started?