# How to Engineer AI Inference Systems with Philip Kiely - #766 Page: https://stenobird.com/podcast/twiml-ai-podcast/how-to-engineer-ai-inference-systems-with-philip-kiely-766 Text version: https://stenobird.com/podcast/twiml-ai-podcast/how-to-engineer-ai-inference-systems-with-philip-kiely-766.md Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast) Published: 2026-04-30T20:21:00+00:00 Episode link: https://twimlai.com/podcast/twimlai/how-engineer-ai-inference-systems Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN3829343846.mp3?updated=1777581088 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-engineer-ai-inference-systems-with-philip-kiely-766 Duration seconds: 3291 ## Resource Inference engineering is the critical discipline of optimizing AI model deployment for performance, cost, and reliability. This discussion explores the technical levers—from quantization to KV cache reuse—that allow engineers to move from generic APIs to high-performance, specialized runtimes. ## Highlights - Main idea: Inference engineering is a distinct discipline blending GPU programming, distributed systems, and applied research - Practical takeaway: Mastering 'the knobs'—batching, quantization, and speculation—is essential for meeting strict product SLAs - Failure mode: Relying solely on closed APIs can limit your ability to optimize for latency and cost as workloads scale - Trend: The industry is moving from simple model serving toward dedicated deployments and in-house inference platforms - Future outlook: Increasing hardware specialization and the rise of agents will require highly optimized, workload-specific runtimes ## Topics Inference Engineering, GPU Optimization, Large Language Models, Quantization, Model Serving, Distributed Systems, AI Infrastructure, Machine Learning Operations ## Chapters - 1:00 — Introduction and Background: A brief introduction to Philip Kiely and his work in AI education and inference engineering. - 4:55 — The Evolution of AI Workloads: Tracing the shift from simple CPU-based classifiers to complex, GPU-accelerated generative models. - 8:45 — The Technical Levers of Inference: Deep dive into the mechanics of inference: quantization, speculation, KV cache reuse, and model parallelization. - 12:50 — Pushing the Envelope in Inference: Discussing the diminishing returns of low-hanging optimization techniques and the search for new frontiers. - 17:00 — Engineering for SLAs and Reliability: How to design products around the realities of token pricing, uptime, and latency constraints. - 21:20 — The Shift to Dedicated Deployments: Analyzing the transition from pay-per-token APIs to managing underlying GPU hardware for better control. - 25:35 — Scaling Inference at the Edge and Enterprise: The challenges of building internal inference platforms and managing distributed edge networks. - 29:45 — GPU Lifecycles and Hardware Economics: The impact of GPU depreciation, the longevity of the Hopper architecture, and the economics of rental markets. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-engineer-ai-inference-systems-with-philip-kiely-766/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/how-to-engineer-ai-inference-systems-with-philip-kiely-766.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.