GTM Engine Background

Playing with Sonnet 4.5: Anthropic’s Powerful New AI

Anthropic’s Sonnet 4.5 promises stronger coding, agent workflows, and computer use with 200K context and new tools. Here’s what’s new—and what to watch...

Playing with Sonnet 4.5: Anthropic’s Powerful New AI

Playing with Sonnet 4.5: Field Notes on Anthropic’s New “Do-Everything” Model

If you’ve been anywhere near AI Twitter, Discords, or your engineering team’s standup this week, you’ve probably felt the buzz around Claude Sonnet 4.5. Anthropic is pitching it as the model for coding, long-running agent workflows, and computer use.

I dug through launch materials, early docs, independent write-ups, and then “played” with the model across product, engineering, research, and writing chores. What follows is a practical, opinionated walkthrough of what Sonnet 4.5 changes, and where it still needs work.

What Actually Is Sonnet 4.5?

Sonnet 4.5 is Anthropic’s latest Sonnet-tier model, a hybrid-reasoning system designed as the sweet spot between raw capability and price/performance.

The claims are loud. “Best model in the world” for agents, coding, and computer use. Stronger accuracy on long tasks. Improved chops in finance and cybersecurity.

Core Specs

FeatureSonnet 4.5 BehaviorContext window200K tokensOutput tokensUp to 64KReasoning styleHybrid: fast short answers, slower extended thinkingPricing$3 per M input tokens, $15 per M output tokensAccess pointsClaude app, Claude API, AWS Bedrock, Google Vertex AI

Beyond Raw IQ: The New Stack

Anthropic didn’t just ship a model. They wrapped Sonnet 4.5 in a set of developer-facing upgrades.

Upgrade CategoryNew CapabilitiesClaude CodeCheckpoints for rollbacks, refreshed terminal UX, VS Code extensionAgent InfrastructureClaude Agent SDK for building multi-tool, multi-step agentsContext ManagementContext editing to clear aging tool calls, memory tool (beta) for state, clearer stop reasons

These aren’t niceties. In long-running automations, context discipline and state hygiene are exactly where agents collapse. Sonnet 4.5’s additions target those failure modes directly.

Benchmarks vs. Real Work

Anthropic’s two flagship benchmarks:

BenchmarkSonnet 4Sonnet 4.5SWE-bench Verified~67%77.2% (82% with high-compute variant)OSWorld Verified42.2%61.4%

On paper, that’s a leap. But the real win I saw wasn’t just “more correct code.” It was steadier rhythm, plan, execute, verify, when asked to modernize a dependency-locked project or slog through flaky test suites.

Think of Sonnet 4.5 not as “instant senior engineer” but as “more reliable foreman.” It keeps jobs on track, especially when you budget for extended thinking.

Safety Posture: ASL-3 and a Hint of Self-Awareness

Sonnet 4.5 ships under AI Safety Level 3 (ASL-3) with stronger classifiers for CBRN risks. The system card highlights improvements in sycophancy, deception, and power-seeking, all critical for autonomous agents.

Safety Snapshot

DimensionBehavior in Sonnet 4.5Deployment levelASL-3ClassifiersStronger filters for CBRN risksRisk behaviorsReduced sycophancy, deception, and power-seekingEval notesSometimes noticed it was being tested and asked evaluators to confirm

Bottom line: 4.5 is safer than earlier Claude models, but treat agents like you’d treat a new junior SRE. Set scopes. Add circuit breakers. Audit the logs.

Developer-Facing Upgrades (Why They Matter)

UpgradeWhy It MattersCheckpoints and VS CodeInstant rollbacks and inline diffs prevent midnight disastersParallel tool callsAgents can fan out searches or batch file reads efficientlyMemory tool (beta)Agents persist state across sessionsContext editingAgents prune stale tool traces before hitting the context wall

Strengths and Weak Spots

StrengthsWeak SpotsEnd-to-end code work (refactors, test fixes, modernizations)Latency vs. quality tradeoff when enabling extended thinkingBrowser and computer use (procurement flows, spreadsheets)Safety filters occasionally overzealousAgent orchestration with better state trackingModels may adapt to synthetic evals, requiring live testingFinance and security analysis pipelinesSetup policies required for reliability

Coding: How It Feels Different

If you used Sonnet 4 or 3.5, you’ll notice 4.5’s tempo shift. It plans migrations, writes tests upfront, and iterates against failing cases before handing you a patch.

Not a perfect one-shot diff, but coherent threads across multiple edits and tool calls. That’s the game-changer.

Pro tip: For larger edits, turn on extended thinking, have it write or update tests first, then run the suite as a gating step.

Computer Use: Less Cursor Lost, More Jobs Done

Task TypeSonnet 4 BehaviorSonnet 4.5 BehaviorOSWorld score42.2%61.4%UI handlingCursor often lost in formsImproved state trackingAdmin tasksFrequent failure on repetitive stepsMore reliable completionMulti-taskingSequential, slowParallel pipelining across tools

For product teams building agentic UI features, the gains are real. But keep permission prompts and kill-switches in front of users. Ease of action cuts both ways.

Deployment and Pricing Paths

OptionWhere to AccessBest FitClaude appWeb, iOS, AndroidIndividuals and small teamsClaude APIclaude-sonnet-4-5-20250929Product integration, startupsAWS Bedrockanthropic.claude-sonnet-4-5-20250929-v1:0Enterprises with AWS governanceGoogle Vertex AIclaude-sonnet-4-5@20250929Enterprises with GCP pipelines

Where Sonnet 4.5 Fits in the Model Landscape

It’s tempting to put Sonnet 4.5 in a head-to-head with GPT or Gemini. The reality is simpler. If your backlog is full of code refactors and spreadsheet-driven yak-shaves, 4.5 is the best default right now.

Creative writing flair, bespoke reasoning styles, or unique ecosystems may still steer you elsewhere. But for tool-heavy, multi-step work, Sonnet 4.5 is the new standard.

Final Word: Who Should Switch Now

Team TypeWhy Switch NowEngineeringBacklogs full of refactors, test repairs, modernizationsOps/RevOps/ProductBrowser and spreadsheet-heavy workflowsFinance/SecurityRoutine patch cycles, portfolio analysis, triage pipelines

Sonnet 4.5 doesn’t “do everything.” But it makes a wider band of real work automatable, reliable, and auditable.

If your mental model of LLMs was “great for drafts, bad for doing,” it’s time to update it. With the right scaffolding, Sonnet 4.5 feels less like a chatty assistant and more like a steady, tool-savvy teammate that can actually move tickets across the board.

About the Author

Robert Moseley

Robert Moseley IV is the Founder and CEO of GTM Engine, a pipeline execution platform that’s changing the way modern revenue teams work. With a background in sales leadership, product strategy, and data architecture, he’s spent more than 10 years helping fast-growing companies move away from manual processes and adopt smarter, scalable systems. At GTM Engine, Robert is building what he calls the go-to-market nervous system. It tracks every interaction, uses AI to enrich CRM data, and gives teams the real-time visibility they need to stay on track. His true north is simple. To take the guesswork out of sales and help revenue teams make decisions based on facts, not gut feel.

Related Articles

GTM Engine Logo

SALES PIPELINE AUTOMATION FAQS

GTM Engine is a Pipeline Execution Platform that automatically analyzes unstructured customer interaction data (like calls, emails, CRM entries, chats) and turns it into structured insights and actions for Sales, Marketing, Customer Success, and Product teams.