Episode
How to Engineer AI Inference Systems with Philip Kiely - #766
- Published
- Apr 30, 2026
- Duration seconds
- 3291
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-engineer-ai-inference-systems-with-philip-kiely-766/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/twiml-ai-podcast/how-to-engineer-ai-inference-systems-with-philip-kiely-766.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Inference engineering is the critical discipline of optimizing AI model deployment for performance, cost, and reliability. This discussion explores the technical levers—from quantization to KV cache reuse—that allow engineers to move from generic APIs to high-performance, specialized runtimes.
Topics
- Inference Engineering
- GPU Optimization
- Large Language Models
- Quantization
- Model Serving
- Distributed Systems
- AI Infrastructure
- Machine Learning Operations
Highlights
- Main idea: Inference engineering is a distinct discipline blending GPU programming, distributed systems, and applied research
- Practical takeaway: Mastering 'the knobs'—batching, quantization, and speculation—is essential for meeting strict product SLAs
- Failure mode: Relying solely on closed APIs can limit your ability to optimize for latency and cost as workloads scale
- Trend: The industry is moving from simple model serving toward dedicated deployments and in-house inference platforms
- Future outlook: Increasing hardware specialization and the rise of agents will require highly optimized, workload-specific runtimes
Chapters
1:00Introduction and Background: A brief introduction to Philip Kiely and his work in AI education and inference engineering.4:55The Evolution of AI Workloads: Tracing the shift from simple CPU-based classifiers to complex, GPU-accelerated generative models.8:45The Technical Levers of Inference: Deep dive into the mechanics of inference: quantization, speculation, KV cache reuse, and model parallelization.12:50Pushing the Envelope in Inference: Discussing the diminishing returns of low-hanging optimization techniques and the search for new frontiers.17:00Engineering for SLAs and Reliability: How to design products around the realities of token pricing, uptime, and latency constraints.21:20The Shift to Dedicated Deployments: Analyzing the transition from pay-per-token APIs to managing underlying GPU hardware for better control.25:35Scaling Inference at the Edge and Enterprise: The challenges of building internal inference platforms and managing distributed edge networks.29:45GPU Lifecycles and Hardware Economics: The impact of GPU depreciation, the longevity of the Hopper architecture, and the economics of rental markets.