Episode
The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic
- Published
- Nov 28, 2024
- Duration seconds
- 4270
- Processing state
processed- Canonical source
- https://www.latent.space/p/claude-sonnet
Actions
POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-new-claude-3-5-sonnet-computer-use-and-building-sota-agents-with-erik-schluntz-anthropic/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/latent-space-ai-engineer/the-new-claude-3-5-sonnet-computer-use-and-building-sota-agents-with-erik-schluntz-anthropic.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
We have announced our first speaker , friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate ! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month. Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors: Solving SWE-Bench As part of the October Sonnet release , Anthropic teased a blink-and-you’ll miss it result: The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench , an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor. T…