# It's 2026, and We're Still Talking Evals Page: https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals Text version: https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community) Published: 2026-04-21T17:00:00+00:00 Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Its-2026--and-Were-Still-Talking-Evals-e3i0pe1 Audio file: https://anchor.fm/s/174cb1b8/podcast/play/118563713/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-16%2F422216432-44100-2-49c4d6cfff2c7.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals Duration seconds: 2456 ## Resource LLM evaluation is not a post-launch checklist but a foundational product strategy that must evolve from pre-production testing to production monitoring. Maggie Konstanty argues that true quality is found by aligning metrics with real user frustration and business outcomes rather than chasing arbitrary accuracy scores. ## Highlights - Main idea: Evaluation must be part of the product DNA from day one, not an afterthought once the agent is shipped - Failure mode: Relying on 'LLM-as-a-judge' for accuracy without verifying it against real-world user signals like drop-off or frustration - Practical takeaway: Use 'frustration signals'—such as users repeatedly asking for more information—as a proxy for agent failure - Failure mode: The '20-evaluator trap,' where teams build complex, disconnected evaluators that don't map to core product goals - Practical takeaway: Mature teams often revert to custom code and manual error analysis because off-the-shelf tools lack the necessary depth for specific business logic ## Topics LLM Evaluation, MLOps, AI Product Management, LLM-as-a-judge, Error Analysis, AI Observability, Regression Testing, User Experience ## Chapters - 1:00 — The Shift from Test Cases to Real Users: Why pre-production test cases fail to account for the unpredictable nature of real-world user queries. - 4:10 — The Risks of LLM-as-a-Judge: Discussing the limitations and potential for error when using LLMs to evaluate other LLMs. - 10:10 — Simulating User Scenarios: Using evaluations as a regression testing tool when introducing new features or prompt changes. - 16:15 — Measuring Trustworthiness vs. Unit Tests: Why standard software unit tests are insufficient for measuring the reliability of LLM behavior. - 22:15 — Leveraging Human-in-the-loop Signals: Using early user groups and feedback loops to identify failure modes before wide release. - 25:20 — Identifying Frustration and Drop-off: How to detect user dissatisfaction through interaction patterns and conversion data. - 34:20 — The Critique of Eval Tooling: Why current observability platforms struggle with sampling and the need for better error analysis interfaces. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.