# AI incidents, audits, and the limits of benchmarks

Page: https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks
Text version: https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks.md
Podcast: [Practical AI](https://stenobird.com/podcast/practical-ai)
Published: 2026-02-13T15:57:56+00:00
Episode link: https://share.transistor.fm/s/1b8e65f4
Audio file: https://pscrb.fm/rss/p/dts.podtrac.com/redirect.mp3/media.transistor.fm/1b8e65f4/5f10a20b.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/practical-ai/episodes/ai-incidents-audits-and-the-limits-of-benchmarks
Duration seconds: 2572

## Resource

Standard AI benchmarks often fail to predict real-world performance because they are designed for research rather than practical safety verification. This discussion explores the necessity of third-party auditing and the growing need for a systematic way to track and prevent AI incidents.

## Highlights
- Main idea: Benchmarks are often built for knowledge generation and research rather than practical, real-world reliability testing
- Failure mode: Relying on 'trust me' benchmarks without verifying the underlying evidence can lead to deploying brittle, unsafe systems
- Practical takeaway: Third-party auditing is essential for organizations to prove the safety and reliability of their models to stakeholders
- Main idea: There is a critical distinction between security (preventing attacks) and safety (ensuring intended behavior) in AI systems
- Failure mode: Tracking minor, repeated harms is less useful than identifying systemic failures that impact the broader AI ecosystem

## Topics

AI Safety, AI Auditing, Machine Learning Evaluation, AI Incident Database, Model Benchmarking, AI Security, AI Verification, Red-Teaming

## Chapters
- 1:00 — Introduction to AI Verification: An introduction to Sean McGregor's work in evaluating machine learning systems and the transition from research to real-world impact.
- 4:05 — The Shift to Evaluation: Discussing the explosion of large language models and the necessity of dedicated testing and evaluation frameworks.
- 7:10 — The Fluidity of Safety Terms: Exploring how terms like safety and security overlap and the difficulty of defining boundaries in evolving AI systems.
- 10:25 — Tracking AI Incidents: The challenges of indexing small, frequent harms versus focusing on significant, systemic AI failures.
- 13:50 — The Role of Third-Party Auditing: Why organizations need external audits to verify that models perform as intended and to build institutional trust.
- 16:50 — Verifying the Balance Sheet: Using a financial metaphor to explain how audits must verify the 'receipts' behind benchmark performance claims.
- 23:05 — The Benchmark Gap: Analyzing why current benchmarks are often unsuitable for practical deployment and real-world safety assessment.
- 26:30 — Security vs. Safety: Distinguishing between the different goals of security professionals and safety researchers in the AI community.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/practical-ai/episodes/ai-incidents-audits-and-the-limits-of-benchmarks/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.