{"podcast":{"title":"Latent Space: The AI Engineer Podcast","slug":"latent-space-ai-engineer","podcast_index_feed_id":6058902,"rss_url":"https://api.substack.com/feed/podcast/1084089.rss","website_url":"https://www.latent.space/podcast","image_url":"https://substackcdn.com/feed/podcast/1084089/ca7468da5614a246d2906ee8926f6de7.jpg","author":"Latent.Space","episode_count":214,"summary":"The AI Engineer newsletter + Top technical AI podcast. How leading labs build Agents, Models, Infra, & AI for Science. See https://latent.space/about for highlights from Greg Brockman, Andrej Karpathy, George Hotz, Simon Willison, Soumith Chintala et al!","last_synced_at":"2026-07-17T00:20:53.505905+00:00","page_url":"https://stenobird.com/podcast/latent-space-ai-engineer"},"episode":{"title":"⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data","slug":"the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data","published_at":"2026-02-23T20:03:11+00:00","page_url":"https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data","show_page_url":"https://stenobird.com/podcast/latent-space-ai-engineer","url":"https://www.latent.space/p/swe-bench-dead","audio_url":"https://api.substack.com/feed/podcast/188928663/d1b8836e5d38b238ccf001345a411fc7.mp3","summary":"OpenAI researchers explain why the industry-standard SWE-bench Verified has reached saturation and is no longer a reliable metric for coding progress. They detail the shift toward SWE-bench Pro to combat data contamination and move toward evaluating long-horizon, complex engineering tasks.","meta_description":"OpenAI's Frontier Evals team discusses the end of SWE-bench Verified, the problem of benchmark contamination, and the future of evaluating AI software eng…","key_points":["Main idea: SWE-bench Verified has become unreliable due to benchmark saturation and significant data contamination in frontier models","Failure mode: Models are demonstrating 'familiarity' with tasks by regurgitating ground truth solutions and repository-specific details rather than solving new problems","Practical takeaway: The field is transitioning to SWE-bench Pro, which features harder, more diverse, and longer-duration tasks (1–4+ hours) to provide more headroom for progress","Main idea: Future coding evaluations must move beyond simple pass/fail tests to measure design taste, code maintainability, and architectural decision-making","Future direction: The next frontier for AI agents lies in automating complex research workflows and end-to-end product development rather than just fixing isolated bugs"],"chapters":[{"start_ms":60000,"title":"Why SWE-bench Stalled","summary":"The thesis behind OpenAI's decision to stop reporting SWE-bench Verified due to saturation and contamination."},{"start_ms":190000,"title":"The Effort Behind Verified","summary":"The massive human-in-the-loop effort required to curate high-quality, expert-reviewed coding tasks."},{"start_ms":310000,"title":"Identifying Contamination","summary":"How models are being caught using 'contamination auditor agents' to reveal familiarity with specific repositories."},{"start_ms":420000,"title":"Unfair Tests and Narrow Specs","summary":"Analyzing how poorly specified problems lead to model failures that don't reflect true reasoning limitations."},{"start_ms":530000,"title":"The Benchmark Evolution Cycle","summary":"Discussing the natural lifecycle of benchmarks as they move from novel metrics to saturated, 'vibe-based' increments."},{"start_ms":645000,"title":"Transitioning to SWE-bench Pro","summary":"The technical advantages of the new Pro benchmark, including increased task complexity and diversity."},{"start_ms":760000,"title":"Defining Ideal Coding Evals","summary":"What true engineering capabilities—like design decisions and efficiency—should look like in a benchmark."},{"start_ms":875000,"title":"Beyond Pass/Fail Tests","summary":"Moving toward evaluating qualitative traits like 'design taste' and long-term code maintainability."}],"topics":["AI Engineering","Software Benchmarks","OpenAI","SWE-bench","LLM Evaluation","Data Contamination","Automated Software Engineering","Frontier Models"],"duration_seconds":1572,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}