# Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable Page: https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable Text version: https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community) Published: 2026-02-19T18:00:02+00:00 Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Serving-LLMs-in-Production-Performance--Cost--Scale--CAST-AI-Roundtable-e3fak28 Audio file: https://anchor.fm/s/174cb1b8/podcast/play/115740168/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-19%2F418421040-44100-2-e2d499a23fae4.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable Duration seconds: 3955 ## Resource Moving LLMs from prototype to production requires balancing the conflicting demands of latency, throughput, and cost. This roundtable explores how to navigate the spectrum between managed APIs and self-hosted infrastructure using data-driven SLOs. ## Highlights - Main idea: Avoid the binary trap of only using APIs or only self-hosting; treat deployment as a spectrum based on specific workload needs - Practical takeaway: Design for optionality so you can migrate between infrastructure tiers without rewriting your entire application - Failure mode: Overlooking concurrency limits and KV cache requirements, which can lead to unexpected latency spikes and wasted GPU capacity - Main idea: Use Service Level Objectives (SLOs) as the foundation for all infrastructure decisions, specifically defining 'goodput' - Practical takeaway: Use quantization and batching strategically to trade off latency for higher throughput and lower cost per token ## Topics LLM Inference, MLOps, GPU Optimization, Model Quantization, Infrastructure Scaling, Cloud Cost Management, Service Level Objectives, Machine Learning Engineering ## Chapters - 6:00 — The Deployment Spectrum: Moving beyond the binary choice between third-party APIs and full self-hosting to find the right balance for your specific use case. - 16:00 — Designing for Movement: Why engineering for optionality prevents costly application rewrites when scaling from prototype to production. - 20:55 — Infrastructure Complexity: Understanding how data restrictions, quotas, and GPU capacity impact the underlying infrastructure decisions. - 30:50 — Latency vs. Throughput: Analyzing the trade-offs between Time to First Token (TTFT) and end-to-end request latency in generative workloads. - 35:50 — Optimizing GPU Utilization: How to balance memory bandwidth, batch sizes, and quantization to minimize cost per token without sacrificing performance. - 40:45 — The Impact of Quantization: A deep dive into how reducing model weights affects memory movement and hardware-level computation efficiency. - 50:40 — The Importance of SLOs: Why defining clear Service Level Objectives and 'goodput' is the most critical step in managing LLM workloads. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.