Episode

It's 2026, and We're Still Talking Evals

Podcast
MLOps.community
Published
Apr 21, 2026
Duration seconds
2456
Processing state
processed
Canonical source
https://podcasters.spotify.com/pod/show/mlops/episodes/Its-2026--and-Were-Still-Talking-Evals-e3i0pe1
Audio
https://anchor.fm/s/174cb1b8/podcast/play/118563713/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-16%2F422216432-44100-2-49c4d6cfff2c7.mp3
JSON
/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals
Markdown
/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

LLM evaluation is not a post-launch checklist but a foundational product strategy that must evolve from pre-production testing to production monitoring. Maggie Konstanty argues that true quality is found by aligning metrics with real user frustration and business outcomes rather than chasing arbitrary accuracy scores.

Topics

  • LLM Evaluation
  • MLOps
  • AI Product Management
  • LLM-as-a-judge
  • Error Analysis
  • AI Observability
  • Regression Testing
  • User Experience

Highlights

  • Main idea: Evaluation must be part of the product DNA from day one, not an afterthought once the agent is shipped
  • Failure mode: Relying on 'LLM-as-a-judge' for accuracy without verifying it against real-world user signals like drop-off or frustration
  • Practical takeaway: Use 'frustration signals'—such as users repeatedly asking for more information—as a proxy for agent failure
  • Failure mode: The '20-evaluator trap,' where teams build complex, disconnected evaluators that don't map to core product goals
  • Practical takeaway: Mature teams often revert to custom code and manual error analysis because off-the-shelf tools lack the necessary depth for specific business logic

Chapters

  1. 1:00 The Shift from Test Cases to Real Users: Why pre-production test cases fail to account for the unpredictable nature of real-world user queries.
  2. 4:10 The Risks of LLM-as-a-Judge: Discussing the limitations and potential for error when using LLMs to evaluate other LLMs.
  3. 10:10 Simulating User Scenarios: Using evaluations as a regression testing tool when introducing new features or prompt changes.
  4. 16:15 Measuring Trustworthiness vs. Unit Tests: Why standard software unit tests are insufficient for measuring the reliability of LLM behavior.
  5. 22:15 Leveraging Human-in-the-loop Signals: Using early user groups and feedback loops to identify failure modes before wide release.
  6. 25:20 Identifying Frustration and Drop-off: How to detect user dissatisfaction through interaction patterns and conversion data.
  7. 34:20 The Critique of Eval Tooling: Why current observability platforms struggle with sampling and the need for better error analysis interfaces.