# ⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Page: https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data
Text version: https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data.md
Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer)
Published: 2026-02-23T20:03:11+00:00
Episode link: https://www.latent.space/p/swe-bench-dead
Audio file: https://api.substack.com/feed/podcast/188928663/d1b8836e5d38b238ccf001345a411fc7.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data
Duration seconds: 1572

## Resource

OpenAI researchers explain why the industry-standard SWE-bench Verified has reached saturation and is no longer a reliable metric for coding progress. They detail the shift toward SWE-bench Pro to combat data contamination and move toward evaluating long-horizon, complex engineering tasks.

## Highlights
- Main idea: SWE-bench Verified has become unreliable due to benchmark saturation and significant data contamination in frontier models
- Failure mode: Models are demonstrating 'familiarity' with tasks by regurgitating ground truth solutions and repository-specific details rather than solving new problems
- Practical takeaway: The field is transitioning to SWE-bench Pro, which features harder, more diverse, and longer-duration tasks (1–4+ hours) to provide more headroom for progress
- Main idea: Future coding evaluations must move beyond simple pass/fail tests to measure design taste, code maintainability, and architectural decision-making
- Future direction: The next frontier for AI agents lies in automating complex research workflows and end-to-end product development rather than just fixing isolated bugs

## Topics

AI Engineering, Software Benchmarks, OpenAI, SWE-bench, LLM Evaluation, Data Contamination, Automated Software Engineering, Frontier Models

## Chapters
- 1:00 — Why SWE-bench Stalled: The thesis behind OpenAI's decision to stop reporting SWE-bench Verified due to saturation and contamination.
- 3:10 — The Effort Behind Verified: The massive human-in-the-loop effort required to curate high-quality, expert-reviewed coding tasks.
- 5:10 — Identifying Contamination: How models are being caught using 'contamination auditor agents' to reveal familiarity with specific repositories.
- 7:00 — Unfair Tests and Narrow Specs: Analyzing how poorly specified problems lead to model failures that don't reflect true reasoning limitations.
- 8:50 — The Benchmark Evolution Cycle: Discussing the natural lifecycle of benchmarks as they move from novel metrics to saturated, 'vibe-based' increments.
- 10:45 — Transitioning to SWE-bench Pro: The technical advantages of the new Pro benchmark, including increased task complexity and diversity.
- 12:40 — Defining Ideal Coding Evals: What true engineering capabilities—like design decisions and efficiency—should look like in a benchmark.
- 14:35 — Beyond Pass/Fail Tests: Moving toward evaluating qualitative traits like 'design taste' and long-term code maintainability.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.