# How to Find the Agent Failures Your Evals Miss with Scott Clark - #767 Page: https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767 Text version: https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast) Published: 2026-05-07T22:46:00+00:00 Episode link: https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7498240745.mp3?updated=1778194521 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767 Duration seconds: 3199 ## Resource Standard evaluations often fail to catch subtle, emergent failures in production LLM agents. This episode explores how to use post-production analytics and vector fingerprints to discover 'unknown unknowns' and close the loop between production behavior and model refinement. ## Highlights - Main idea: Implement a 'Maslow’s hierarchy of observability' moving from basic telemetry to advanced analytics for discovering unknown failure modes - Failure mode: 'Lazy' tool-use hallucinations and subtle sub-distributions of behavior often bypass standard pre-production benchmarks - Practical takeaway: Use vector fingerprints of agent traces to cluster behaviors and identify specific pockets of sub-optimal performance - Practical takeaway: Integrate production analytics into a data flywheel to automatically generate new evals, guardrails, and fine-tuning datasets - Technical approach: Leverage OpenTelemetry and GenAI semantic conventions to build a robust foundation for agent monitoring ## Topics LLM Agents, Observability, Production AI, Machine Learning Analytics, Vector Embeddings, Model Evaluation, OpenTelemetry, Agentic Workflows ## Chapters - 1:00 — The Hierarchy of Observability: Introduction to the layers of observability: telemetry for logging, monitoring for known signals, and analytics for discovering unknown unknowns. - 5:20 — Beyond Model Optimization: Shifting focus from squeezing marginal gains in benchmarks to ensuring reliability and trustworthiness in production environments. - 9:00 — Identifying Agent Anti-patterns: A look at real-world production failures, such as subtle hallucinations that standard evaluation suites fail to detect. - 12:35 — Clustering Behavior via Vector Fingerprints: Using high-dimensional distributions and stratified sampling to identify and taxonomize specific sub-patterns in agent traces. - 17:05 — Building the Data Flywheel: How production analytics can be used to recursively refine system prompts, create new evaluation dimensions, and drive continuous improvement. - 21:05 — Adaptive Analytics and the Feedback Loop: The importance of an adaptive approach to analytics that learns which signals matter most to the specific use case. - 33:20 — Practical Implementation and Tooling: Recommendations for instrumentation using OpenTelemetry and the GenAI semantic conventions to enable effective monitoring. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.