Episode

Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable

Podcast: MLOps.community
Published: Feb 19, 2026
Duration seconds: 3955
Processing state: processed
Canonical source: https://podcasters.spotify.com/pod/show/mlops/episodes/Serving-LLMs-in-Production-Performance--Cost--Scale--CAST-AI-Roundtable-e3fak28
Audio: https://anchor.fm/s/174cb1b8/podcast/play/115740168/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-19%2F418421040-44100-2-e2d499a23fae4.mp3
JSON: /v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable
Markdown: /podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md

Actions

POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Moving LLMs from prototype to production requires balancing the conflicting demands of latency, throughput, and cost. This roundtable explores how to navigate the spectrum between managed APIs and self-hosted infrastructure using data-driven SLOs.

Topics

LLM Inference
MLOps
GPU Optimization
Model Quantization
Infrastructure Scaling
Cloud Cost Management
Service Level Objectives
Machine Learning Engineering

Highlights

Main idea: Avoid the binary trap of only using APIs or only self-hosting; treat deployment as a spectrum based on specific workload needs
Practical takeaway: Design for optionality so you can migrate between infrastructure tiers without rewriting your entire application
Failure mode: Overlooking concurrency limits and KV cache requirements, which can lead to unexpected latency spikes and wasted GPU capacity
Main idea: Use Service Level Objectives (SLOs) as the foundation for all infrastructure decisions, specifically defining 'goodput'
Practical takeaway: Use quantization and batching strategically to trade off latency for higher throughput and lower cost per token

Chapters

6:00 The Deployment Spectrum: Moving beyond the binary choice between third-party APIs and full self-hosting to find the right balance for your specific use case.
16:00 Designing for Movement: Why engineering for optionality prevents costly application rewrites when scaling from prototype to production.
20:55 Infrastructure Complexity: Understanding how data restrictions, quotas, and GPU capacity impact the underlying infrastructure decisions.
30:50 Latency vs. Throughput: Analyzing the trade-offs between Time to First Token (TTFT) and end-to-end request latency in generative workloads.
35:50 Optimizing GPU Utilization: How to balance memory bandwidth, batch sizes, and quantization to minimize cost per token without sacrificing performance.
40:45 The Impact of Quantization: A deep dive into how reducing model weights affects memory movement and hardware-level computation efficiency.
50:40 The Importance of SLOs: Why defining clear Service Level Objectives and 'goodput' is the most critical step in managing LLM workloads.