Episode
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
- Published
- May 7, 2026
- Duration seconds
- 3199
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Standard evaluations often fail to catch subtle, emergent failures in production LLM agents. This episode explores how to use post-production analytics and vector fingerprints to discover 'unknown unknowns' and close the loop between production behavior and model refinement.
Topics
- LLM Agents
- Observability
- Production AI
- Machine Learning Analytics
- Vector Embeddings
- Model Evaluation
- OpenTelemetry
- Agentic Workflows
Highlights
- Main idea: Implement a 'Maslow’s hierarchy of observability' moving from basic telemetry to advanced analytics for discovering unknown failure modes
- Failure mode: 'Lazy' tool-use hallucinations and subtle sub-distributions of behavior often bypass standard pre-production benchmarks
- Practical takeaway: Use vector fingerprints of agent traces to cluster behaviors and identify specific pockets of sub-optimal performance
- Practical takeaway: Integrate production analytics into a data flywheel to automatically generate new evals, guardrails, and fine-tuning datasets
- Technical approach: Leverage OpenTelemetry and GenAI semantic conventions to build a robust foundation for agent monitoring
Chapters
1:00The Hierarchy of Observability: Introduction to the layers of observability: telemetry for logging, monitoring for known signals, and analytics for discovering unknown unknowns.5:20Beyond Model Optimization: Shifting focus from squeezing marginal gains in benchmarks to ensuring reliability and trustworthiness in production environments.9:00Identifying Agent Anti-patterns: A look at real-world production failures, such as subtle hallucinations that standard evaluation suites fail to detect.12:35Clustering Behavior via Vector Fingerprints: Using high-dimensional distributions and stratified sampling to identify and taxonomize specific sub-patterns in agent traces.17:05Building the Data Flywheel: How production analytics can be used to recursively refine system prompts, create new evaluation dimensions, and drive continuous improvement.21:05Adaptive Analytics and the Feedback Loop: The importance of an adaptive approach to analytics that learns which signals matter most to the specific use case.33:20Practical Implementation and Tooling: Recommendations for instrumentation using OpenTelemetry and the GenAI semantic conventions to enable effective monitoring.