{"podcast":{"title":"Practical AI","slug":"practical-ai","podcast_index_feed_id":444526,"rss_url":"https://feeds.transistor.fm/practical-ai-machine-learning-data-science-llm","website_url":"https://practicalai.fm","image_url":"https://img.transistorcdn.com/WMlp2ug34XB6LDJ3-vnzti_-_y144LUlFW0Xzzn3fss/rs:fill:0:0:1/w:1400/h:1400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS8wMTZi/ZWJmNWIwNDdmYTcw/NGJjMTExZjNjZmYy/M2ZjNS5wbmc.jpg","author":"Practical AI LLC","episode_count":357,"summary":"Making artificial intelligence practical, productive & accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, GANs, MLOps, AIOps, LLMs & more). The focus is on productive implementations and real-world scenarios that are accessible to everyone. If you want to keep up with the latest advances in AI, while keeping one foot in the real world, then this is the show for you!","last_synced_at":null,"page_url":"https://stenobird.com/podcast/practical-ai"},"episode":{"title":"AI incidents, audits, and the limits of benchmarks","slug":"ai-incidents-audits-and-the-limits-of-benchmarks","published_at":"2026-02-13T15:57:56+00:00","page_url":"https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks","show_page_url":"https://stenobird.com/podcast/practical-ai","url":"https://share.transistor.fm/s/1b8e65f4","audio_url":"https://pscrb.fm/rss/p/dts.podtrac.com/redirect.mp3/media.transistor.fm/1b8e65f4/5f10a20b.mp3","summary":"Standard AI benchmarks often fail to predict real-world performance because they are designed for research rather than practical safety verification. This discussion explores the necessity of third-party auditing and the growing need for a systematic way to track and prevent AI incidents.","meta_description":"Explore the gap between AI benchmarks and real-world safety. Learn why third-party auditing and incident tracking are critical for reliable AI deployment.","key_points":["Main idea: Benchmarks are often built for knowledge generation and research rather than practical, real-world reliability testing","Failure mode: Relying on 'trust me' benchmarks without verifying the underlying evidence can lead to deploying brittle, unsafe systems","Practical takeaway: Third-party auditing is essential for organizations to prove the safety and reliability of their models to stakeholders","Main idea: There is a critical distinction between security (preventing attacks) and safety (ensuring intended behavior) in AI systems","Failure mode: Tracking minor, repeated harms is less useful than identifying systemic failures that impact the broader AI ecosystem"],"chapters":[{"start_ms":60000,"title":"Introduction to AI Verification","summary":"An introduction to Sean McGregor's work in evaluating machine learning systems and the transition from research to real-world impact."},{"start_ms":245000,"title":"The Shift to Evaluation","summary":"Discussing the explosion of large language models and the necessity of dedicated testing and evaluation frameworks."},{"start_ms":430000,"title":"The Fluidity of Safety Terms","summary":"Exploring how terms like safety and security overlap and the difficulty of defining boundaries in evolving AI systems."},{"start_ms":625000,"title":"Tracking AI Incidents","summary":"The challenges of indexing small, frequent harms versus focusing on significant, systemic AI failures."},{"start_ms":830000,"title":"The Role of Third-Party Auditing","summary":"Why organizations need external audits to verify that models perform as intended and to build institutional trust."},{"start_ms":1010000,"title":"Verifying the Balance Sheet","summary":"Using a financial metaphor to explain how audits must verify the 'receipts' behind benchmark performance claims."},{"start_ms":1385000,"title":"The Benchmark Gap","summary":"Analyzing why current benchmarks are often unsuitable for practical deployment and real-world safety assessment."},{"start_ms":1590000,"title":"Security vs. Safety","summary":"Distinguishing between the different goals of security professionals and safety researchers in the AI community."}],"topics":["AI Safety","AI Auditing","Machine Learning Evaluation","AI Incident Database","Model Benchmarking","AI Security","AI Verification","Red-Teaming"],"duration_seconds":2572,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/practical-ai/episodes/ai-incidents-audits-and-the-limits-of-benchmarks/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/practical-ai/ai-incidents-audits-and-the-limits-of-benchmarks.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}