Episode
It's 2026, and We're Still Talking Evals
- Podcast
- MLOps.community
- Published
- Apr 21, 2026
- Duration seconds
- 2456
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
LLM evaluation is not a post-launch checklist but a foundational product strategy that must evolve from pre-production testing to production monitoring. Maggie Konstanty argues that true quality is found by aligning metrics with real user frustration and business outcomes rather than chasing arbitrary accuracy scores.
Topics
- LLM Evaluation
- MLOps
- AI Product Management
- LLM-as-a-judge
- Error Analysis
- AI Observability
- Regression Testing
- User Experience
Highlights
- Main idea: Evaluation must be part of the product DNA from day one, not an afterthought once the agent is shipped
- Failure mode: Relying on 'LLM-as-a-judge' for accuracy without verifying it against real-world user signals like drop-off or frustration
- Practical takeaway: Use 'frustration signals'—such as users repeatedly asking for more information—as a proxy for agent failure
- Failure mode: The '20-evaluator trap,' where teams build complex, disconnected evaluators that don't map to core product goals
- Practical takeaway: Mature teams often revert to custom code and manual error analysis because off-the-shelf tools lack the necessary depth for specific business logic
Chapters
1:00The Shift from Test Cases to Real Users: Why pre-production test cases fail to account for the unpredictable nature of real-world user queries.4:10The Risks of LLM-as-a-Judge: Discussing the limitations and potential for error when using LLMs to evaluate other LLMs.10:10Simulating User Scenarios: Using evaluations as a regression testing tool when introducing new features or prompt changes.16:15Measuring Trustworthiness vs. Unit Tests: Why standard software unit tests are insufficient for measuring the reliability of LLM behavior.22:15Leveraging Human-in-the-loop Signals: Using early user groups and feedback loops to identify failure modes before wide release.25:20Identifying Frustration and Drop-off: How to detect user dissatisfaction through interaction patterns and conversion data.34:20The Critique of Eval Tooling: Why current observability platforms struggle with sampling and the need for better error analysis interfaces.