Episode
Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable
- Podcast
- MLOps.community
- Published
- Feb 19, 2026
- Duration seconds
- 3955
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Moving LLMs from prototype to production requires balancing the conflicting demands of latency, throughput, and cost. This roundtable explores how to navigate the spectrum between managed APIs and self-hosted infrastructure using data-driven SLOs.
Topics
- LLM Inference
- MLOps
- GPU Optimization
- Model Quantization
- Infrastructure Scaling
- Cloud Cost Management
- Service Level Objectives
- Machine Learning Engineering
Highlights
- Main idea: Avoid the binary trap of only using APIs or only self-hosting; treat deployment as a spectrum based on specific workload needs
- Practical takeaway: Design for optionality so you can migrate between infrastructure tiers without rewriting your entire application
- Failure mode: Overlooking concurrency limits and KV cache requirements, which can lead to unexpected latency spikes and wasted GPU capacity
- Main idea: Use Service Level Objectives (SLOs) as the foundation for all infrastructure decisions, specifically defining 'goodput'
- Practical takeaway: Use quantization and batching strategically to trade off latency for higher throughput and lower cost per token
Chapters
6:00The Deployment Spectrum: Moving beyond the binary choice between third-party APIs and full self-hosting to find the right balance for your specific use case.16:00Designing for Movement: Why engineering for optionality prevents costly application rewrites when scaling from prototype to production.20:55Infrastructure Complexity: Understanding how data restrictions, quotas, and GPU capacity impact the underlying infrastructure decisions.30:50Latency vs. Throughput: Analyzing the trade-offs between Time to First Token (TTFT) and end-to-end request latency in generative workloads.35:50Optimizing GPU Utilization: How to balance memory bandwidth, batch sizes, and quantization to minimize cost per token without sacrificing performance.40:45The Impact of Quantization: A deep dive into how reducing model weights affects memory movement and hardware-level computation efficiency.50:40The Importance of SLOs: Why defining clear Service Level Objectives and 'goodput' is the most critical step in managing LLM workloads.