# It's 2026, and We're Still Talking Evals

Page: https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals
Text version: https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md
Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community)
Published: 2026-04-21T17:00:00+00:00
Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Its-2026--and-Were-Still-Talking-Evals-e3i0pe1
Audio file: https://anchor.fm/s/174cb1b8/podcast/play/118563713/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-16%2F422216432-44100-2-49c4d6cfff2c7.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals
Duration seconds: 2456

## Resource

LLM evaluation is not a post-launch checklist but a foundational product strategy that must evolve from pre-production testing to production monitoring. Maggie Konstanty argues that true quality is found by aligning metrics with real user frustration and business outcomes rather than chasing arbitrary accuracy scores.

## Highlights
- Main idea: Evaluation must be part of the product DNA from day one, not an afterthought once the agent is shipped
- Failure mode: Relying on 'LLM-as-a-judge' for accuracy without verifying it against real-world user signals like drop-off or frustration
- Practical takeaway: Use 'frustration signals'—such as users repeatedly asking for more information—as a proxy for agent failure
- Failure mode: The '20-evaluator trap,' where teams build complex, disconnected evaluators that don't map to core product goals
- Practical takeaway: Mature teams often revert to custom code and manual error analysis because off-the-shelf tools lack the necessary depth for specific business logic

## Topics

LLM Evaluation, MLOps, AI Product Management, LLM-as-a-judge, Error Analysis, AI Observability, Regression Testing, User Experience

## Chapters
- 1:00 — The Shift from Test Cases to Real Users: Why pre-production test cases fail to account for the unpredictable nature of real-world user queries.
- 4:10 — The Risks of LLM-as-a-Judge: Discussing the limitations and potential for error when using LLMs to evaluate other LLMs.
- 10:10 — Simulating User Scenarios: Using evaluations as a regression testing tool when introducing new features or prompt changes.
- 16:15 — Measuring Trustworthiness vs. Unit Tests: Why standard software unit tests are insufficient for measuring the reliability of LLM behavior.
- 22:15 — Leveraging Human-in-the-loop Signals: Using early user groups and feedback loops to identify failure modes before wide release.
- 25:20 — Identifying Frustration and Drop-off: How to detect user dissatisfaction through interaction patterns and conversion data.
- 34:20 — The Critique of Eval Tooling: Why current observability platforms struggle with sampling and the need for better error analysis interfaces.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.