# Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable

Page: https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable
Text version: https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md
Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community)
Published: 2026-02-19T18:00:02+00:00
Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Serving-LLMs-in-Production-Performance--Cost--Scale--CAST-AI-Roundtable-e3fak28
Audio file: https://anchor.fm/s/174cb1b8/podcast/play/115740168/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-19%2F418421040-44100-2-e2d499a23fae4.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable
Duration seconds: 3955

## Resource

Moving LLMs from prototype to production requires balancing the conflicting demands of latency, throughput, and cost. This roundtable explores how to navigate the spectrum between managed APIs and self-hosted infrastructure using data-driven SLOs.

## Highlights
- Main idea: Avoid the binary trap of only using APIs or only self-hosting; treat deployment as a spectrum based on specific workload needs
- Practical takeaway: Design for optionality so you can migrate between infrastructure tiers without rewriting your entire application
- Failure mode: Overlooking concurrency limits and KV cache requirements, which can lead to unexpected latency spikes and wasted GPU capacity
- Main idea: Use Service Level Objectives (SLOs) as the foundation for all infrastructure decisions, specifically defining 'goodput'
- Practical takeaway: Use quantization and batching strategically to trade off latency for higher throughput and lower cost per token

## Topics

LLM Inference, MLOps, GPU Optimization, Model Quantization, Infrastructure Scaling, Cloud Cost Management, Service Level Objectives, Machine Learning Engineering

## Chapters
- 6:00 — The Deployment Spectrum: Moving beyond the binary choice between third-party APIs and full self-hosting to find the right balance for your specific use case.
- 16:00 — Designing for Movement: Why engineering for optionality prevents costly application rewrites when scaling from prototype to production.
- 20:55 — Infrastructure Complexity: Understanding how data restrictions, quotas, and GPU capacity impact the underlying infrastructure decisions.
- 30:50 — Latency vs. Throughput: Analyzing the trade-offs between Time to First Token (TTFT) and end-to-end request latency in generative workloads.
- 35:50 — Optimizing GPU Utilization: How to balance memory bandwidth, batch sizes, and quantization to minimize cost per token without sacrificing performance.
- 40:45 — The Impact of Quantization: A deep dive into how reducing model weights affects memory movement and hardware-level computation efficiency.
- 50:40 — The Importance of SLOs: Why defining clear Service Level Objectives and 'goodput' is the most critical step in managing LLM workloads.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/mlops-community/serving-llms-in-production-performance-cost-scale-cast-ai-roundtable.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.