Episode
⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
- Published
- Feb 23, 2026
- Duration seconds
- 1572
- Processing state
processed- Canonical source
- https://www.latent.space/p/swe-bench-dead
Actions
POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/latent-space-ai-engineer/the-end-of-swe-bench-verified-mia-glaese-olivia-watkins-openai-frontier-evals-human-data.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
OpenAI researchers explain why the industry-standard SWE-bench Verified has reached saturation and is no longer a reliable metric for coding progress. They detail the shift toward SWE-bench Pro to combat data contamination and move toward evaluating long-horizon, complex engineering tasks.
Topics
- AI Engineering
- Software Benchmarks
- OpenAI
- SWE-bench
- LLM Evaluation
- Data Contamination
- Automated Software Engineering
- Frontier Models
Highlights
- Main idea: SWE-bench Verified has become unreliable due to benchmark saturation and significant data contamination in frontier models
- Failure mode: Models are demonstrating 'familiarity' with tasks by regurgitating ground truth solutions and repository-specific details rather than solving new problems
- Practical takeaway: The field is transitioning to SWE-bench Pro, which features harder, more diverse, and longer-duration tasks (1–4+ hours) to provide more headroom for progress
- Main idea: Future coding evaluations must move beyond simple pass/fail tests to measure design taste, code maintainability, and architectural decision-making
- Future direction: The next frontier for AI agents lies in automating complex research workflows and end-to-end product development rather than just fixing isolated bugs
Chapters
1:00Why SWE-bench Stalled: The thesis behind OpenAI's decision to stop reporting SWE-bench Verified due to saturation and contamination.3:10The Effort Behind Verified: The massive human-in-the-loop effort required to curate high-quality, expert-reviewed coding tasks.5:10Identifying Contamination: How models are being caught using 'contamination auditor agents' to reveal familiarity with specific repositories.7:00Unfair Tests and Narrow Specs: Analyzing how poorly specified problems lead to model failures that don't reflect true reasoning limitations.8:50The Benchmark Evolution Cycle: Discussing the natural lifecycle of benchmarks as they move from novel metrics to saturated, 'vibe-based' increments.10:45Transitioning to SWE-bench Pro: The technical advantages of the new Pro benchmark, including increased task complexity and diversity.12:40Defining Ideal Coding Evals: What true engineering capabilities—like design decisions and efficiency—should look like in a benchmark.14:35Beyond Pass/Fail Tests: Moving toward evaluating qualitative traits like 'design taste' and long-term code maintainability.