GTM Engine Background

Playing with Anthropic’s Sonnet 4.5

Anthropic’s Sonnet 4.5 promises stronger coding, agent workflows, and computer use with 200K context and new tools. Here’s what’s new and what to watch...

Playing with Anthropic’s Sonnet 4.5

Playing with Sonnet 4.5

If you’ve been anywhere near AI Twitter, Discords, or your engineering team’s standup this week, you’ve probably felt the buzz around Claude Sonnet 4.5. Anthropic is pitching it as the model for coding, long-running agent workflows, and computer use. I spent time digging through the launch materials, early docs, and independent write-ups, then “played” with the model across typical product, engineering, research, and writing chores. Here’s a practical, opinionated walkthrough of what Sonnet 4.5 changes and where it still needs work.

What actually is Sonnet 4.5?

Sonnet 4.5 is Anthropic’s latest Sonnet-tier model. It's a hybrid-reasoning system positioned as the sweet spot between raw capability and price/performance. The headline claims are bold: “best model in the world” for agents, coding, and computer use, with stronger accuracy on long tasks and improved domain chops in finance and cybersecurity. It ships with a 200K context window and supports up to 64K output tokens, which matters for dense plans, long diffs, and big code edits. Pricing remains $3 per million input tokens and $15 per million output tokens. It’s live in the Claude app, via API, and on AWS Bedrock and Google Cloud Vertex AI.

Under the hood, 4.5 inherits Sonnet 4’s hybrid-reasoning approach (“short answers when you need speed, extended thinking when you need rigor”). Anthropic emphasizes that you can now choose how long the model “thinks,” trading latency and cost for better multi-step reasoning when necessary.

What’s new beyond raw IQ?

Anthropic rolled out a stack around Sonnet 4.5:

• Claude Code upgrades: checkpoints for instant rollbacks, refreshed terminal UX, and a native VS Code extension.

• Agent infrastructure: a Claude Agent SDK (the same scaffolding Anthropic uses internally) now available for building your own multi-tool, multi-step agents.

• Context management features: “context editing” to automatically clear aging tool calls, a memory tool (beta) for state across sessions, and clearer “stop reasons”.

These aren’t just niceties: in long-running automations, context discipline and state hygiene are where most agents actually fall apart. Sonnet 4.5’s additions are aimed directly at those failure modes.

Benchmarks and the “real work” gap

• SWE-bench Verified (real GitHub issues & tests): Sonnet 4.5 posts 77.2% under a simple two-tool scaffold. A high-compute variant with parallel attempts and candidate selection reaches 82.0%.

• OSWorld-Verified (computer use): Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%.

The best way to think about Sonnet 4.5’s benchmark wins is not “instant senior engineer,” but “more reliable foreman” for complex, tool-assisted jobs; especially when you let it spend a bit more “thinking” budget.

Availability, pricing, and where to run it

Sonnet 4.5 is broadly available:

• Claude.ai (web, iOS, Android) for chat; Claude Code for agentic coding.

• Claude API (claude-sonnet-4-5-20250929) with the new context tools.

• AWS Bedrock (anthropic.claude-sonnet-4-5-20250929-v1:0) and Google Cloud Vertex AI (claude-sonnet-4-5@20250929) for enterprise pipelines.

Price: $3 / $15 per million input/output tokens (prompt caching and batch options can trim cost).

Safety posture: ASL-3 and “situational awareness”

Sonnet 4.5 is released under AI Safety Level 3 (ASL-3) with stronger classifiers for CBRN (chemical/biological/radiological/nuclear) risks. The system card flags improved behavior on hazards like sycophancy, deception, and power-seeking.

One surprising result: in external safety evaluations, Sonnet 4.5 sometimes seemed to notice it was being tested and asked evaluators to be transparent about that. Anthropic interprets this as a call for more realistic, less “contrived” evals.

Bottom line: 4.5 looks safer than prior Claude models, but the lab-vs-field gap in agent safety is real. Treat long-running automations like you would a new junior SRE: set scopes, add circuit breakers, and audit logs.

The developer-facing upgrades (and why they matter)

• Checkpoints, VS Code, and better tool calls: inline diffs, tighter edit loops, parallel tool calls, improved token tracking, and clearer stop signals.

• Memory and context editing: persist state beyond a single session and clear stale tool traces before hitting token limits.

Strengths vs. pain points

Where Sonnet 4.5 shines

• End-to-end code work: planning, refactors, test fixes, dependency modernizations.

• Browser/computer use: site navigation, procurement, spreadsheets.

• Agent orchestration: steadier progress, better tool selection, improved state tracking.

• Finance/security: competent in multi-step pipelines and patching workflows.

Where to be cautious

• Extended thinking costs tokens and time.

• Safety filters may occasionally over-block benign content.

• Evaluation awareness suggests teams should rely on live evals.

Setup paths

ScenarioBest PathWhy it’s Good
Solo dev / writerClaude app + Claude CodeEasy agent + file creation
Startup with cloud mixClaude API + Agent SDKFull control; context editing + memory
Enterprise with governanceAWS Bedrock or GCP Vertex AIIAM, audit, guardrails

Coding: how different does it feel?

4.5 is more likely to outline a plan, write tests up front, and iterate against failing cases before producing a patch. On tough refactors, it maintains coherence across multiple edits and tool calls.

Pro tip: Turn on extended thinking, generate/update tests first, and run the suite before applying patches.

Computer use: beyond “click this, type that”

With a leap from 42.2% → 61.4% on OSWorld, Sonnet 4.5 is steadier in UI navigation and repetitive workflows. For product teams, improved state tracking and parallel tool calls enable pipelined work.

Safety & agent reality

• Scope narrow, expand later

• Record decisions with progress summaries

• Fallbacks: route to Sonnet 4 or humans when blocked

• Live evals: measure performance in messy, real environments

Feature snapshot

CapabilitySonnet 4.5 Behavior
Context window200K tokens
Output tokensUp to 64K
Extended thinkingOptional; boosts complex reasoning
Tool useParallelization, formatting preserved
Memory tool (beta)Persist state across sessions
Context editingClears stale traces
Computer use61.4% OSWorld
Coding77.2–82.0% SWE-bench
Safety levelASL-3

Pricing & deployment table

OptionAccess PointUse CaseNotes
Claude appWeb, iOS, AndroidSmall teamsFile creation + execution
Claude APIclaude-sonnet-4-5-20250929Integration$3/$15 per M tokens
AWS Bedrockanthropic.claude-sonnet-4-5-20250929-v1:0EnterpriseIAM + governance
Google Vertex AIclaude-sonnet-4-5@20250929EnterpriseModel Garden / Marketplace

Migration

• From Sonnet 4: swap model name, adopt context tools. Cannot set temperature and top_p together.

• From 3.x: stability gains in tool-heavy jobs. Start small, measure cost/latency, expand.

Hands-on patterns that worked

• Test-first edits

• Plan → checklist → execute

• Context “gardening”

• Progress receipts

Where Sonnet 4.5 fits

Sonnet 4.5 is the default choice for coding, computer use, and agentic jobs where parallel tools and long-horizon stability matter. Creative writing or niche ecosystems may push you elsewhere, but for repetitive, structured work, 4.5 is hard to beat.

The safety conversation we should actually have

• Anthropic’s alignment work: measurable improvements

• Reality: distribution shifts in messy environments remain a risk

Keep scopes tight, log actions, and prefer “ask to act” over full autonomy.

Mini-playbook: rolling out a Sonnet 4.5 coding agent

• Seed prompts: write/update tests first; parallelize repo search + advisory lookup.

• Guardrails: directory allowlist; CI must pass; hard cap edits.

• Context policies: edit at 150K tokens, summarize to memory hourly.

• Human-in-the-loop: mandatory review on sensitive changes.

• Fallbacks: downgrade to Sonnet 4 or escalate to humans.

FAQ

Is it the best coding model?

Excellent SWE-bench numbers (77–82%). Strong default, depends on your stack.

How long can it run?

Reported 30+ hours unattended; supervision still recommended.

Does it hallucinate less?

Improved alignment, but not flawless. Always verify outputs.

Cost control?

Stick to standard mode, use extended thinking selectively, and lean on caching/batching.

The short version

• It’s a leap for coding, computer use, and long-running agents.

• Adopt the stack (checkpoints, context editing, memory, SDK), not just the model.

• Respect the safety envelope and keep humans in the loop.

Who should switch now?

• Engineering teams: backlogs of refactors, tests, and modernizations.

• Ops/Product: spreadsheet-heavy, browser workflows.

• Finance/Security: triage, patch cycles, clean audit workflows.

Sonnet 4.5 doesn’t “do everything” magically but it makes more real work automatable, reliable, and auditable. With the right scaffolding, it feels less like a chatty assistant and more like a steady, tool-savvy teammate that can actually move tickets across the board.

About the Author

Robert Moseley

Robert Moseley IV is the Founder and CEO of GTM Engine, a pipeline execution platform that’s changing the way modern revenue teams work. With a background in sales leadership, product strategy, and data architecture, he’s spent more than 10 years helping fast-growing companies move away from manual processes and adopt smarter, scalable systems. At GTM Engine, Robert is building what he calls the go-to-market nervous system. It tracks every interaction, uses AI to enrich CRM data, and gives teams the real-time visibility they need to stay on track. His true north is simple. To take the guesswork out of sales and help revenue teams make decisions based on facts, not gut feel.

Related Articles

GTM Engine Logo

SALES PIPELINE AUTOMATION FAQS

GTM Engine is a Pipeline Execution Platform that automatically analyzes unstructured customer interaction data (like calls, emails, CRM entries, chats) and turns it into structured insights and actions for Sales, Marketing, Customer Success, and Product teams.