{"podcast":{"title":"MLOps.community","slug":"mlops-community","podcast_index_feed_id":28679,"rss_url":"https://anchor.fm/s/174cb1b8/podcast/rss","website_url":"https://mlops.community","image_url":"https://d3t3ozftmdmh3i.cloudfront.net/production/podcast_uploaded_nologo/3809022/3809022-1612190855115-e91f8b881173f.jpg","author":"Demetrios","episode_count":516,"summary":"Relaxed Conversations around getting AI into production, whatever shape that may come in (agentic, traditional ML, LLMs, Vibes, etc)","last_synced_at":null,"page_url":"https://stenobird.com/podcast/mlops-community"},"episode":{"title":"It's 2026, and We're Still Talking Evals","slug":"it-s-2026-and-we-re-still-talking-evals","published_at":"2026-04-21T17:00:00+00:00","page_url":"https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals","show_page_url":"https://stenobird.com/podcast/mlops-community","url":"https://podcasters.spotify.com/pod/show/mlops/episodes/Its-2026--and-Were-Still-Talking-Evals-e3i0pe1","audio_url":"https://anchor.fm/s/174cb1b8/podcast/play/118563713/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-16%2F422216432-44100-2-49c4d6cfff2c7.mp3","summary":"LLM evaluation is not a post-launch checklist but a foundational product strategy that must evolve from pre-production testing to production monitoring. Maggie Konstanty argues that true quality is found by aligning metrics with real user frustration and business outcomes rather than chasing arbitrary accuracy scores.","meta_description":"Learn why accuracy metrics lie and how to build a robust LLM evaluation strategy that captures real user behavior and prevents production regressions.","key_points":["Main idea: Evaluation must be part of the product DNA from day one, not an afterthought once the agent is shipped","Failure mode: Relying on 'LLM-as-a-judge' for accuracy without verifying it against real-world user signals like drop-off or frustration","Practical takeaway: Use 'frustration signals'—such as users repeatedly asking for more information—as a proxy for agent failure","Failure mode: The '20-evaluator trap,' where teams build complex, disconnected evaluators that don't map to core product goals","Practical takeaway: Mature teams often revert to custom code and manual error analysis because off-the-shelf tools lack the necessary depth for specific business logic"],"chapters":[{"start_ms":60000,"title":"The Shift from Test Cases to Real Users","summary":"Why pre-production test cases fail to account for the unpredictable nature of real-world user queries."},{"start_ms":250000,"title":"The Risks of LLM-as-a-Judge","summary":"Discussing the limitations and potential for error when using LLMs to evaluate other LLMs."},{"start_ms":610000,"title":"Simulating User Scenarios","summary":"Using evaluations as a regression testing tool when introducing new features or prompt changes."},{"start_ms":975000,"title":"Measuring Trustworthiness vs. Unit Tests","summary":"Why standard software unit tests are insufficient for measuring the reliability of LLM behavior."},{"start_ms":1335000,"title":"Leveraging Human-in-the-loop Signals","summary":"Using early user groups and feedback loops to identify failure modes before wide release."},{"start_ms":1520000,"title":"Identifying Frustration and Drop-off","summary":"How to detect user dissatisfaction through interaction patterns and conversion data."},{"start_ms":2060000,"title":"The Critique of Eval Tooling","summary":"Why current observability platforms struggle with sampling and the need for better error analysis interfaces."}],"topics":["LLM Evaluation","MLOps","AI Product Management","LLM-as-a-judge","Error Analysis","AI Observability","Regression Testing","User Experience"],"duration_seconds":2456,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/mlops-community/episodes/it-s-2026-and-we-re-still-talking-evals/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/mlops-community/it-s-2026-and-we-re-still-talking-evals.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}