Playing with Sonnet 4.5: Field Notes on Anthropic’s New “Do-Everything” Model
If you’ve been anywhere near AI Twitter, Discords, or your engineering team’s standup this week, you’ve probably felt the buzz around Claude Sonnet 4.5. Anthropic is pitching it as the model for coding, long-running agent workflows, and computer use.
I dug through launch materials, early docs, independent write-ups, and then “played” with the model across product, engineering, research, and writing chores. What follows is a practical, opinionated walkthrough of what Sonnet 4.5 changes, and where it still needs work.
What Actually Is Sonnet 4.5?
Sonnet 4.5 is Anthropic’s latest Sonnet-tier model, a hybrid-reasoning system designed as the sweet spot between raw capability and price/performance.
The claims are loud. “Best model in the world” for agents, coding, and computer use. Stronger accuracy on long tasks. Improved chops in finance and cybersecurity.
Core Specs
FeatureSonnet 4.5 BehaviorContext window200K tokensOutput tokensUp to 64KReasoning styleHybrid: fast short answers, slower extended thinkingPricing$3 per M input tokens, $15 per M output tokensAccess pointsClaude app, Claude API, AWS Bedrock, Google Vertex AI
Beyond Raw IQ: The New Stack
Anthropic didn’t just ship a model. They wrapped Sonnet 4.5 in a set of developer-facing upgrades.
Upgrade CategoryNew CapabilitiesClaude CodeCheckpoints for rollbacks, refreshed terminal UX, VS Code extensionAgent InfrastructureClaude Agent SDK for building multi-tool, multi-step agentsContext ManagementContext editing to clear aging tool calls, memory tool (beta) for state, clearer stop reasons
These aren’t niceties. In long-running automations, context discipline and state hygiene are exactly where agents collapse. Sonnet 4.5’s additions target those failure modes directly.
Benchmarks vs. Real Work
Anthropic’s two flagship benchmarks:
BenchmarkSonnet 4Sonnet 4.5SWE-bench Verified~67%77.2% (82% with high-compute variant)OSWorld Verified42.2%61.4%
On paper, that’s a leap. But the real win I saw wasn’t just “more correct code.” It was steadier rhythm, plan, execute, verify, when asked to modernize a dependency-locked project or slog through flaky test suites.
Think of Sonnet 4.5 not as “instant senior engineer” but as “more reliable foreman.” It keeps jobs on track, especially when you budget for extended thinking.
Safety Posture: ASL-3 and a Hint of Self-Awareness
Sonnet 4.5 ships under AI Safety Level 3 (ASL-3) with stronger classifiers for CBRN risks. The system card highlights improvements in sycophancy, deception, and power-seeking, all critical for autonomous agents.
Safety Snapshot
DimensionBehavior in Sonnet 4.5Deployment levelASL-3ClassifiersStronger filters for CBRN risksRisk behaviorsReduced sycophancy, deception, and power-seekingEval notesSometimes noticed it was being tested and asked evaluators to confirm
Bottom line: 4.5 is safer than earlier Claude models, but treat agents like you’d treat a new junior SRE. Set scopes. Add circuit breakers. Audit the logs.
Developer-Facing Upgrades (Why They Matter)
UpgradeWhy It MattersCheckpoints and VS CodeInstant rollbacks and inline diffs prevent midnight disastersParallel tool callsAgents can fan out searches or batch file reads efficientlyMemory tool (beta)Agents persist state across sessionsContext editingAgents prune stale tool traces before hitting the context wall
Strengths and Weak Spots
StrengthsWeak SpotsEnd-to-end code work (refactors, test fixes, modernizations)Latency vs. quality tradeoff when enabling extended thinkingBrowser and computer use (procurement flows, spreadsheets)Safety filters occasionally overzealousAgent orchestration with better state trackingModels may adapt to synthetic evals, requiring live testingFinance and security analysis pipelinesSetup policies required for reliability
Coding: How It Feels Different
If you used Sonnet 4 or 3.5, you’ll notice 4.5’s tempo shift. It plans migrations, writes tests upfront, and iterates against failing cases before handing you a patch.
Not a perfect one-shot diff, but coherent threads across multiple edits and tool calls. That’s the game-changer.
Pro tip: For larger edits, turn on extended thinking, have it write or update tests first, then run the suite as a gating step.
Computer Use: Less Cursor Lost, More Jobs Done
Task TypeSonnet 4 BehaviorSonnet 4.5 BehaviorOSWorld score42.2%61.4%UI handlingCursor often lost in formsImproved state trackingAdmin tasksFrequent failure on repetitive stepsMore reliable completionMulti-taskingSequential, slowParallel pipelining across tools
For product teams building agentic UI features, the gains are real. But keep permission prompts and kill-switches in front of users. Ease of action cuts both ways.
Deployment and Pricing Paths
OptionWhere to AccessBest FitClaude appWeb, iOS, AndroidIndividuals and small teamsClaude APIclaude-sonnet-4-5-20250929
Product integration, startupsAWS Bedrockanthropic.claude-sonnet-4-5-20250929-v1:0
Enterprises with AWS governanceGoogle Vertex AIclaude-sonnet-4-5@20250929
Enterprises with GCP pipelines
Where Sonnet 4.5 Fits in the Model Landscape
It’s tempting to put Sonnet 4.5 in a head-to-head with GPT or Gemini. The reality is simpler. If your backlog is full of code refactors and spreadsheet-driven yak-shaves, 4.5 is the best default right now.
Creative writing flair, bespoke reasoning styles, or unique ecosystems may still steer you elsewhere. But for tool-heavy, multi-step work, Sonnet 4.5 is the new standard.
Final Word: Who Should Switch Now
Team TypeWhy Switch NowEngineeringBacklogs full of refactors, test repairs, modernizationsOps/RevOps/ProductBrowser and spreadsheet-heavy workflowsFinance/SecurityRoutine patch cycles, portfolio analysis, triage pipelines
Sonnet 4.5 doesn’t “do everything.” But it makes a wider band of real work automatable, reliable, and auditable.
If your mental model of LLMs was “great for drafts, bad for doing,” it’s time to update it. With the right scaffolding, Sonnet 4.5 feels less like a chatty assistant and more like a steady, tool-savvy teammate that can actually move tickets across the board.
About the Author

Robert Moseley IV is the Founder and CEO of GTM Engine, a pipeline execution platform that’s changing the way modern revenue teams work. With a background in sales leadership, product strategy, and data architecture, he’s spent more than 10 years helping fast-growing companies move away from manual processes and adopt smarter, scalable systems. At GTM Engine, Robert is building what he calls the go-to-market nervous system. It tracks every interaction, uses AI to enrich CRM data, and gives teams the real-time visibility they need to stay on track. His true north is simple. To take the guesswork out of sales and help revenue teams make decisions based on facts, not gut feel.