# ⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data Page: https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data Text version: https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data.md Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer) Published: 2026-02-23T20:03:11+00:00 Episode link: https://www.latent.space/p/swe-bench-dead Audio file: https://api.substack.com/feed/podcast/188928663/d1b8836e5d38b238ccf001345a411fc7.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data Duration seconds: 1572 ## Resource OpenAI researchers explain why the industry-standard SWE-bench Verified has reached saturation and is no longer a reliable metric for coding progress. They detail the shift toward SWE-bench Pro to combat data contamination and move toward evaluating long-horizon, complex engineering tasks. ## Highlights - Main idea: SWE-bench Verified has become unreliable due to benchmark saturation and significant data contamination in frontier models - Failure mode: Models are demonstrating 'familiarity' with tasks by regurgitating ground truth solutions and repository-specific details rather than solving new problems - Practical takeaway: The field is transitioning to SWE-bench Pro, which features harder, more diverse, and longer-duration tasks (1–4+ hours) to provide more headroom for progress - Main idea: Future coding evaluations must move beyond simple pass/fail tests to measure design taste, code maintainability, and architectural decision-making - Future direction: The next frontier for AI agents lies in automating complex research workflows and end-to-end product development rather than just fixing isolated bugs ## Topics AI Engineering, Software Benchmarks, OpenAI, SWE-bench, LLM Evaluation, Data Contamination, Automated Software Engineering, Frontier Models ## Chapters - 1:00 — Why SWE-bench Stalled: The thesis behind OpenAI's decision to stop reporting SWE-bench Verified due to saturation and contamination. - 3:10 — The Effort Behind Verified: The massive human-in-the-loop effort required to curate high-quality, expert-reviewed coding tasks. - 5:10 — Identifying Contamination: How models are being caught using 'contamination auditor agents' to reveal familiarity with specific repositories. - 7:00 — Unfair Tests and Narrow Specs: Analyzing how poorly specified problems lead to model failures that don't reflect true reasoning limitations. - 8:50 — The Benchmark Evolution Cycle: Discussing the natural lifecycle of benchmarks as they move from novel metrics to saturated, 'vibe-based' increments. - 10:45 — Transitioning to SWE-bench Pro: The technical advantages of the new Pro benchmark, including increased task complexity and diversity. - 12:40 — Defining Ideal Coding Evals: What true engineering capabilities—like design decisions and efficiency—should look like in a benchmark. - 14:35 — Beyond Pass/Fail Tests: Moving toward evaluating qualitative traits like 'design taste' and long-term code maintainability. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.