Episode

How to Find the Agent Failures Your Evals Miss with Scott Clark - #767

Podcast: The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Published: May 7, 2026
Duration seconds: 3199
Processing state: processed
Canonical source: https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss
Audio: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7498240745.mp3?updated=1778194521
JSON: /v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767
Markdown: /podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md

Actions

POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Standard evaluations often fail to catch subtle, emergent failures in production LLM agents. This episode explores how to use post-production analytics and vector fingerprints to discover 'unknown unknowns' and close the loop between production behavior and model refinement.

Topics

LLM Agents
Observability
Production AI
Machine Learning Analytics
Vector Embeddings
Model Evaluation
OpenTelemetry
Agentic Workflows

Highlights

Main idea: Implement a 'Maslow’s hierarchy of observability' moving from basic telemetry to advanced analytics for discovering unknown failure modes
Failure mode: 'Lazy' tool-use hallucinations and subtle sub-distributions of behavior often bypass standard pre-production benchmarks
Practical takeaway: Use vector fingerprints of agent traces to cluster behaviors and identify specific pockets of sub-optimal performance
Practical takeaway: Integrate production analytics into a data flywheel to automatically generate new evals, guardrails, and fine-tuning datasets
Technical approach: Leverage OpenTelemetry and GenAI semantic conventions to build a robust foundation for agent monitoring

Chapters

1:00 The Hierarchy of Observability: Introduction to the layers of observability: telemetry for logging, monitoring for known signals, and analytics for discovering unknown unknowns.
5:20 Beyond Model Optimization: Shifting focus from squeezing marginal gains in benchmarks to ensuring reliability and trustworthiness in production environments.
9:00 Identifying Agent Anti-patterns: A look at real-world production failures, such as subtle hallucinations that standard evaluation suites fail to detect.
12:35 Clustering Behavior via Vector Fingerprints: Using high-dimensional distributions and stratified sampling to identify and taxonomize specific sub-patterns in agent traces.
17:05 Building the Data Flywheel: How production analytics can be used to recursively refine system prompts, create new evaluation dimensions, and drive continuous improvement.
21:05 Adaptive Analytics and the Feedback Loop: The importance of an adaptive approach to analytics that learns which signals matter most to the specific use case.
33:20 Practical Implementation and Tooling: Recommendations for instrumentation using OpenTelemetry and the GenAI semantic conventions to enable effective monitoring.