Episode

How to Find the Agent Failures Your Evals Miss with Scott Clark - #767

Podcast
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Published
May 7, 2026
Duration seconds
3199
Processing state
processed
Canonical source
https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss
Audio
https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7498240745.mp3?updated=1778194521
JSON
/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767
Markdown
/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Standard evaluations often fail to catch subtle, emergent failures in production LLM agents. This episode explores how to use post-production analytics and vector fingerprints to discover 'unknown unknowns' and close the loop between production behavior and model refinement.

Topics

  • LLM Agents
  • Observability
  • Production AI
  • Machine Learning Analytics
  • Vector Embeddings
  • Model Evaluation
  • OpenTelemetry
  • Agentic Workflows

Highlights

  • Main idea: Implement a 'Maslow’s hierarchy of observability' moving from basic telemetry to advanced analytics for discovering unknown failure modes
  • Failure mode: 'Lazy' tool-use hallucinations and subtle sub-distributions of behavior often bypass standard pre-production benchmarks
  • Practical takeaway: Use vector fingerprints of agent traces to cluster behaviors and identify specific pockets of sub-optimal performance
  • Practical takeaway: Integrate production analytics into a data flywheel to automatically generate new evals, guardrails, and fine-tuning datasets
  • Technical approach: Leverage OpenTelemetry and GenAI semantic conventions to build a robust foundation for agent monitoring

Chapters

  1. 1:00 The Hierarchy of Observability: Introduction to the layers of observability: telemetry for logging, monitoring for known signals, and analytics for discovering unknown unknowns.
  2. 5:20 Beyond Model Optimization: Shifting focus from squeezing marginal gains in benchmarks to ensuring reliability and trustworthiness in production environments.
  3. 9:00 Identifying Agent Anti-patterns: A look at real-world production failures, such as subtle hallucinations that standard evaluation suites fail to detect.
  4. 12:35 Clustering Behavior via Vector Fingerprints: Using high-dimensional distributions and stratified sampling to identify and taxonomize specific sub-patterns in agent traces.
  5. 17:05 Building the Data Flywheel: How production analytics can be used to recursively refine system prompts, create new evaluation dimensions, and drive continuous improvement.
  6. 21:05 Adaptive Analytics and the Feedback Loop: The importance of an adaptive approach to analytics that learns which signals matter most to the specific use case.
  7. 33:20 Practical Implementation and Tooling: Recommendations for instrumentation using OpenTelemetry and the GenAI semantic conventions to enable effective monitoring.