# AI incidents, audits, and the limits of benchmarks Page: https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks Text version: https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks.md Podcast: [Practical AI](https://stenobird.com/podcast/practical-ai) Published: 2026-02-13T15:57:56+00:00 Episode link: https://share.transistor.fm/s/1b8e65f4 Audio file: https://pscrb.fm/rss/p/dts.podtrac.com/redirect.mp3/media.transistor.fm/1b8e65f4/5f10a20b.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/practical-ai/episodes/ai-incidents-audits-and-the-limits-of-benchmarks Duration seconds: 2572 ## Resource Standard AI benchmarks often fail to predict real-world performance because they are designed for research rather than practical safety verification. This discussion explores the necessity of third-party auditing and the growing need for a systematic way to track and prevent AI incidents. ## Highlights - Main idea: Benchmarks are often built for knowledge generation and research rather than practical, real-world reliability testing - Failure mode: Relying on 'trust me' benchmarks without verifying the underlying evidence can lead to deploying brittle, unsafe systems - Practical takeaway: Third-party auditing is essential for organizations to prove the safety and reliability of their models to stakeholders - Main idea: There is a critical distinction between security (preventing attacks) and safety (ensuring intended behavior) in AI systems - Failure mode: Tracking minor, repeated harms is less useful than identifying systemic failures that impact the broader AI ecosystem ## Topics AI Safety, AI Auditing, Machine Learning Evaluation, AI Incident Database, Model Benchmarking, AI Security, AI Verification, Red-Teaming ## Chapters - 1:00 — Introduction to AI Verification: An introduction to Sean McGregor's work in evaluating machine learning systems and the transition from research to real-world impact. - 4:05 — The Shift to Evaluation: Discussing the explosion of large language models and the necessity of dedicated testing and evaluation frameworks. - 7:10 — The Fluidity of Safety Terms: Exploring how terms like safety and security overlap and the difficulty of defining boundaries in evolving AI systems. - 10:25 — Tracking AI Incidents: The challenges of indexing small, frequent harms versus focusing on significant, systemic AI failures. - 13:50 — The Role of Third-Party Auditing: Why organizations need external audits to verify that models perform as intended and to build institutional trust. - 16:50 — Verifying the Balance Sheet: Using a financial metaphor to explain how audits must verify the 'receipts' behind benchmark performance claims. - 23:05 — The Benchmark Gap: Analyzing why current benchmarks are often unsuitable for practical deployment and real-world safety assessment. - 26:30 — Security vs. Safety: Distinguishing between the different goals of security professionals and safety researchers in the AI community. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/practical-ai/episodes/ai-incidents-audits-and-the-limits-of-benchmarks/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.