{"podcast":{"title":"The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)","slug":"twiml-ai-podcast","podcast_index_feed_id":1045879,"rss_url":"https://feeds.megaphone.fm/MLN2155636147","website_url":"https://twimlai.com","image_url":"https://megaphone.imgix.net/podcasts/35230150-ee98-11eb-ad1a-b38cbabcd053/image/TWIML_AI_Podcast_Official_Cover_Art_1400px.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress","author":"TWIML","episode_count":785,"summary":"Machine learning and artificial intelligence are dramatically changing the way businesses operate and people live. The TWIML AI Podcast brings the top minds and ideas from the world of ML and AI to a broad and influential community of ML/AI researchers, data scientists, engineers and tech-savvy business and IT leaders. Hosted by Sam Charrington, a sought after industry analyst, speaker, commentator and thought leader. Technologies covered include machine learning, artificial intelligence, deep learning, natural language processing, neural networks, analytics, computer science, data science and more.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/twiml-ai-podcast"},"episode":{"title":"How to Find the Agent Failures Your Evals Miss with Scott Clark - #767","slug":"how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767","published_at":"2026-05-07T22:46:00+00:00","page_url":"https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767","show_page_url":"https://stenobird.com/podcast/twiml-ai-podcast","url":"https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss","audio_url":"https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7498240745.mp3?updated=1778194521","summary":"Standard evaluations often fail to catch subtle, emergent failures in production LLM agents. This episode explores how to use post-production analytics and vector fingerprints to discover 'unknown unknowns' and close the loop between production behavior and model refinement.","meta_description":"Learn how to use adaptive analytics and vector clustering to find the LLM agent failures that traditional evaluations miss.","key_points":["Main idea: Implement a 'Maslow’s hierarchy of observability' moving from basic telemetry to advanced analytics for discovering unknown failure modes","Failure mode: 'Lazy' tool-use hallucinations and subtle sub-distributions of behavior often bypass standard pre-production benchmarks","Practical takeaway: Use vector fingerprints of agent traces to cluster behaviors and identify specific pockets of sub-optimal performance","Practical takeaway: Integrate production analytics into a data flywheel to automatically generate new evals, guardrails, and fine-tuning datasets","Technical approach: Leverage OpenTelemetry and GenAI semantic conventions to build a robust foundation for agent monitoring"],"chapters":[{"start_ms":60000,"title":"The Hierarchy of Observability","summary":"Introduction to the layers of observability: telemetry for logging, monitoring for known signals, and analytics for discovering unknown unknowns."},{"start_ms":320000,"title":"Beyond Model Optimization","summary":"Shifting focus from squeezing marginal gains in benchmarks to ensuring reliability and trustworthiness in production environments."},{"start_ms":540000,"title":"Identifying Agent Anti-patterns","summary":"A look at real-world production failures, such as subtle hallucinations that standard evaluation suites fail to detect."},{"start_ms":755000,"title":"Clustering Behavior via Vector Fingerprints","summary":"Using high-dimensional distributions and stratified sampling to identify and taxonomize specific sub-patterns in agent traces."},{"start_ms":1025000,"title":"Building the Data Flywheel","summary":"How production analytics can be used to recursively refine system prompts, create new evaluation dimensions, and drive continuous improvement."},{"start_ms":1265000,"title":"Adaptive Analytics and the Feedback Loop","summary":"The importance of an adaptive approach to analytics that learns which signals matter most to the specific use case."},{"start_ms":2000000,"title":"Practical Implementation and Tooling","summary":"Recommendations for instrumentation using OpenTelemetry and the GenAI semantic conventions to enable effective monitoring."}],"topics":["LLM Agents","Observability","Production AI","Machine Learning Analytics","Vector Embeddings","Model Evaluation","OpenTelemetry","Agentic Workflows"],"duration_seconds":3199,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}