# How to Find the Agent Failures Your Evals Miss with Scott Clark - #767

Page: https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767
Text version: https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md
Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast)
Published: 2026-05-07T22:46:00+00:00
Episode link: https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss
Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7498240745.mp3?updated=1778194521
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767
Duration seconds: 3199

## Resource

Standard evaluations often fail to catch subtle, emergent failures in production LLM agents. This episode explores how to use post-production analytics and vector fingerprints to discover 'unknown unknowns' and close the loop between production behavior and model refinement.

## Highlights
- Main idea: Implement a 'Maslow’s hierarchy of observability' moving from basic telemetry to advanced analytics for discovering unknown failure modes
- Failure mode: 'Lazy' tool-use hallucinations and subtle sub-distributions of behavior often bypass standard pre-production benchmarks
- Practical takeaway: Use vector fingerprints of agent traces to cluster behaviors and identify specific pockets of sub-optimal performance
- Practical takeaway: Integrate production analytics into a data flywheel to automatically generate new evals, guardrails, and fine-tuning datasets
- Technical approach: Leverage OpenTelemetry and GenAI semantic conventions to build a robust foundation for agent monitoring

## Topics

LLM Agents, Observability, Production AI, Machine Learning Analytics, Vector Embeddings, Model Evaluation, OpenTelemetry, Agentic Workflows

## Chapters
- 1:00 — The Hierarchy of Observability: Introduction to the layers of observability: telemetry for logging, monitoring for known signals, and analytics for discovering unknown unknowns.
- 5:20 — Beyond Model Optimization: Shifting focus from squeezing marginal gains in benchmarks to ensuring reliability and trustworthiness in production environments.
- 9:00 — Identifying Agent Anti-patterns: A look at real-world production failures, such as subtle hallucinations that standard evaluation suites fail to detect.
- 12:35 — Clustering Behavior via Vector Fingerprints: Using high-dimensional distributions and stratified sampling to identify and taxonomize specific sub-patterns in agent traces.
- 17:05 — Building the Data Flywheel: How production analytics can be used to recursively refine system prompts, create new evaluation dimensions, and drive continuous improvement.
- 21:05 — Adaptive Analytics and the Feedback Loop: The importance of an adaptive approach to analytics that learns which signals matter most to the specific use case.
- 33:20 — Practical Implementation and Tooling: Recommendations for instrumentation using OpenTelemetry and the GenAI semantic conventions to enable effective monitoring.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/how-to-find-the-agent-failures-your-evals-miss-with-scott-clark-767.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.