Episode

Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable

Podcast
MLOps.community
Published
Feb 19, 2026
Duration seconds
3955
Processing state
processed
Canonical source
https://podcasters.spotify.com/pod/show/mlops/episodes/Serving-LLMs-in-Production-Performance--Cost--Scale--CAST-AI-Roundtable-e3fak28
Audio
https://anchor.fm/s/174cb1b8/podcast/play/115740168/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-19%2F418421040-44100-2-e2d499a23fae4.mp3
JSON
/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable
Markdown
/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Moving LLMs from prototype to production requires balancing the conflicting demands of latency, throughput, and cost. This roundtable explores how to navigate the spectrum between managed APIs and self-hosted infrastructure using data-driven SLOs.

Topics

  • LLM Inference
  • MLOps
  • GPU Optimization
  • Model Quantization
  • Infrastructure Scaling
  • Cloud Cost Management
  • Service Level Objectives
  • Machine Learning Engineering

Highlights

  • Main idea: Avoid the binary trap of only using APIs or only self-hosting; treat deployment as a spectrum based on specific workload needs
  • Practical takeaway: Design for optionality so you can migrate between infrastructure tiers without rewriting your entire application
  • Failure mode: Overlooking concurrency limits and KV cache requirements, which can lead to unexpected latency spikes and wasted GPU capacity
  • Main idea: Use Service Level Objectives (SLOs) as the foundation for all infrastructure decisions, specifically defining 'goodput'
  • Practical takeaway: Use quantization and batching strategically to trade off latency for higher throughput and lower cost per token

Chapters

  1. 6:00 The Deployment Spectrum: Moving beyond the binary choice between third-party APIs and full self-hosting to find the right balance for your specific use case.
  2. 16:00 Designing for Movement: Why engineering for optionality prevents costly application rewrites when scaling from prototype to production.
  3. 20:55 Infrastructure Complexity: Understanding how data restrictions, quotas, and GPU capacity impact the underlying infrastructure decisions.
  4. 30:50 Latency vs. Throughput: Analyzing the trade-offs between Time to First Token (TTFT) and end-to-end request latency in generative workloads.
  5. 35:50 Optimizing GPU Utilization: How to balance memory bandwidth, batch sizes, and quantization to minimize cost per token without sacrificing performance.
  6. 40:45 The Impact of Quantization: A deep dive into how reducing model weights affects memory movement and hardware-level computation efficiency.
  7. 50:40 The Importance of SLOs: Why defining clear Service Level Objectives and 'goodput' is the most critical step in managing LLM workloads.