{"podcast":{"title":"MLOps.community","slug":"mlops-community","podcast_index_feed_id":28679,"rss_url":"https://anchor.fm/s/174cb1b8/podcast/rss","website_url":"https://mlops.community","image_url":"https://d3t3ozftmdmh3i.cloudfront.net/production/podcast_uploaded_nologo/3809022/3809022-1612190855115-e91f8b881173f.jpg","author":"Demetrios","episode_count":516,"summary":"Relaxed Conversations around getting AI into production, whatever shape that may come in (agentic, traditional ML, LLMs, Vibes, etc)","last_synced_at":null,"page_url":"https://stenobird.com/podcast/mlops-community"},"episode":{"title":"Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable","slug":"serving-llms-in-production-performance-cost-scale-cast-ai-roundtable","published_at":"2026-02-19T18:00:02+00:00","page_url":"https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable","show_page_url":"https://stenobird.com/podcast/mlops-community","url":"https://podcasters.spotify.com/pod/show/mlops/episodes/Serving-LLMs-in-Production-Performance--Cost--Scale--CAST-AI-Roundtable-e3fak28","audio_url":"https://anchor.fm/s/174cb1b8/podcast/play/115740168/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-19%2F418421040-44100-2-e2d499a23fae4.mp3","summary":"Moving LLMs from prototype to production requires balancing the conflicting demands of latency, throughput, and cost. This roundtable explores how to navigate the spectrum between managed APIs and self-hosted infrastructure using data-driven SLOs.","meta_description":"Learn how to optimize LLM inference in production by balancing GPU utilization, quantization, and cost-effective infrastructure scaling.","key_points":["Main idea: Avoid the binary trap of only using APIs or only self-hosting; treat deployment as a spectrum based on specific workload needs","Practical takeaway: Design for optionality so you can migrate between infrastructure tiers without rewriting your entire application","Failure mode: Overlooking concurrency limits and KV cache requirements, which can lead to unexpected latency spikes and wasted GPU capacity","Main idea: Use Service Level Objectives (SLOs) as the foundation for all infrastructure decisions, specifically defining 'goodput'","Practical takeaway: Use quantization and batching strategically to trade off latency for higher throughput and lower cost per token"],"chapters":[{"start_ms":360000,"title":"The Deployment Spectrum","summary":"Moving beyond the binary choice between third-party APIs and full self-hosting to find the right balance for your specific use case."},{"start_ms":960000,"title":"Designing for Movement","summary":"Why engineering for optionality prevents costly application rewrites when scaling from prototype to production."},{"start_ms":1255000,"title":"Infrastructure Complexity","summary":"Understanding how data restrictions, quotas, and GPU capacity impact the underlying infrastructure decisions."},{"start_ms":1850000,"title":"Latency vs. Throughput","summary":"Analyzing the trade-offs between Time to First Token (TTFT) and end-to-end request latency in generative workloads."},{"start_ms":2150000,"title":"Optimizing GPU Utilization","summary":"How to balance memory bandwidth, batch sizes, and quantization to minimize cost per token without sacrificing performance."},{"start_ms":2445000,"title":"The Impact of Quantization","summary":"A deep dive into how reducing model weights affects memory movement and hardware-level computation efficiency."},{"start_ms":3040000,"title":"The Importance of SLOs","summary":"Why defining clear Service Level Objectives and 'goodput' is the most critical step in managing LLM workloads."}],"topics":["LLM Inference","MLOps","GPU Optimization","Model Quantization","Infrastructure Scaling","Cloud Cost Management","Service Level Objectives","Machine Learning Engineering"],"duration_seconds":3955,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}