Episode

It's 2026, and We're Still Talking Evals

Podcast: MLOps.community
Published: Apr 21, 2026
Duration seconds: 2456
Processing state: processed
Canonical source: https://podcasters.spotify.com/pod/show/mlops/episodes/Its-2026--and-Were-Still-Talking-Evals-e3i0pe1
Audio: https://anchor.fm/s/174cb1b8/podcast/play/118563713/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-16%2F422216432-44100-2-49c4d6cfff2c7.mp3
JSON: /v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals
Markdown: /podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md

Actions

POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

LLM evaluation is not a post-launch checklist but a foundational product strategy that must evolve from pre-production testing to production monitoring. Maggie Konstanty argues that true quality is found by aligning metrics with real user frustration and business outcomes rather than chasing arbitrary accuracy scores.

Topics

LLM Evaluation
MLOps
AI Product Management
LLM-as-a-judge
Error Analysis
AI Observability
Regression Testing
User Experience

Highlights

Main idea: Evaluation must be part of the product DNA from day one, not an afterthought once the agent is shipped
Failure mode: Relying on 'LLM-as-a-judge' for accuracy without verifying it against real-world user signals like drop-off or frustration
Practical takeaway: Use 'frustration signals'—such as users repeatedly asking for more information—as a proxy for agent failure
Failure mode: The '20-evaluator trap,' where teams build complex, disconnected evaluators that don't map to core product goals
Practical takeaway: Mature teams often revert to custom code and manual error analysis because off-the-shelf tools lack the necessary depth for specific business logic

Chapters

1:00 The Shift from Test Cases to Real Users: Why pre-production test cases fail to account for the unpredictable nature of real-world user queries.
4:10 The Risks of LLM-as-a-Judge: Discussing the limitations and potential for error when using LLMs to evaluate other LLMs.
10:10 Simulating User Scenarios: Using evaluations as a regression testing tool when introducing new features or prompt changes.
16:15 Measuring Trustworthiness vs. Unit Tests: Why standard software unit tests are insufficient for measuring the reliability of LLM behavior.
22:15 Leveraging Human-in-the-loop Signals: Using early user groups and feedback loops to identify failure modes before wide release.
25:20 Identifying Frustration and Drop-off: How to detect user dissatisfaction through interaction patterns and conversion data.
34:20 The Critique of Eval Tooling: Why current observability platforms struggle with sampling and the need for better error analysis interfaces.