Episode

AI incidents, audits, and the limits of benchmarks

Podcast
Practical AI
Published
Feb 13, 2026
Duration seconds
2572
Processing state
processed
Canonical source
https://share.transistor.fm/s/1b8e65f4
Audio
https://pscrb.fm/rss/p/dts.podtrac.com/redirect.mp3/media.transistor.fm/1b8e65f4/5f10a20b.mp3
JSON
/v1/public/podcasts/practical-ai/episodes/ai-incidents-audits-and-the-limits-of-benchmarks
Markdown
/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/practical-ai/episodes/ai-incidents-audits-and-the-limits-of-benchmarks/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Standard AI benchmarks often fail to predict real-world performance because they are designed for research rather than practical safety verification. This discussion explores the necessity of third-party auditing and the growing need for a systematic way to track and prevent AI incidents.

Topics

  • AI Safety
  • AI Auditing
  • Machine Learning Evaluation
  • AI Incident Database
  • Model Benchmarking
  • AI Security
  • AI Verification
  • Red-Teaming

Highlights

  • Main idea: Benchmarks are often built for knowledge generation and research rather than practical, real-world reliability testing
  • Failure mode: Relying on 'trust me' benchmarks without verifying the underlying evidence can lead to deploying brittle, unsafe systems
  • Practical takeaway: Third-party auditing is essential for organizations to prove the safety and reliability of their models to stakeholders
  • Main idea: There is a critical distinction between security (preventing attacks) and safety (ensuring intended behavior) in AI systems
  • Failure mode: Tracking minor, repeated harms is less useful than identifying systemic failures that impact the broader AI ecosystem

Chapters

  1. 1:00 Introduction to AI Verification: An introduction to Sean McGregor's work in evaluating machine learning systems and the transition from research to real-world impact.
  2. 4:05 The Shift to Evaluation: Discussing the explosion of large language models and the necessity of dedicated testing and evaluation frameworks.
  3. 7:10 The Fluidity of Safety Terms: Exploring how terms like safety and security overlap and the difficulty of defining boundaries in evolving AI systems.
  4. 10:25 Tracking AI Incidents: The challenges of indexing small, frequent harms versus focusing on significant, systemic AI failures.
  5. 13:50 The Role of Third-Party Auditing: Why organizations need external audits to verify that models perform as intended and to build institutional trust.
  6. 16:50 Verifying the Balance Sheet: Using a financial metaphor to explain how audits must verify the 'receipts' behind benchmark performance claims.
  7. 23:05 The Benchmark Gap: Analyzing why current benchmarks are often unsuitable for practical deployment and real-world safety assessment.
  8. 26:30 Security vs. Safety: Distinguishing between the different goals of security professionals and safety researchers in the AI community.