{"podcast":{"title":"The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)","slug":"twiml-ai-podcast","podcast_index_feed_id":1045879,"rss_url":"https://feeds.megaphone.fm/MLN2155636147","website_url":"https://twimlai.com","image_url":"https://megaphone.imgix.net/podcasts/35230150-ee98-11eb-ad1a-b38cbabcd053/image/TWIML_AI_Podcast_Official_Cover_Art_1400px.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress","author":"TWIML","episode_count":785,"summary":"Machine learning and artificial intelligence are dramatically changing the way businesses operate and people live. The TWIML AI Podcast brings the top minds and ideas from the world of ML and AI to a broad and influential community of ML/AI researchers, data scientists, engineers and tech-savvy business and IT leaders. Hosted by Sam Charrington, a sought after industry analyst, speaker, commentator and thought leader. Technologies covered include machine learning, artificial intelligence, deep learning, natural language processing, neural networks, analytics, computer science, data science and more.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/twiml-ai-podcast"},"episode":{"title":"How to Engineer AI Inference Systems with Philip Kiely - #766","slug":"how-to-engineer-ai-inference-systems-with-philip-kiely-766","published_at":"2026-04-30T20:21:00+00:00","page_url":"https://stenobird.com/podcast/twiml-ai-podcast/how-to-engineer-ai-inference-systems-with-philip-kiely-766","show_page_url":"https://stenobird.com/podcast/twiml-ai-podcast","url":"https://twimlai.com/podcast/twimlai/how-engineer-ai-inference-systems","audio_url":"https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN3829343846.mp3?updated=1777581088","summary":"Inference engineering is the critical discipline of optimizing AI model deployment for performance, cost, and reliability. This discussion explores the technical levers—from quantization to KV cache reuse—that allow engineers to move from generic APIs to high-performance, specialized runtimes.","meta_description":"Explore the frontier of AI inference engineering, covering GPU lifecycles, quantization, vLLM, and the shift toward specialized hardware and runtimes.","key_points":["Main idea: Inference engineering is a distinct discipline blending GPU programming, distributed systems, and applied research","Practical takeaway: Mastering 'the knobs'—batching, quantization, and speculation—is essential for meeting strict product SLAs","Failure mode: Relying solely on closed APIs can limit your ability to optimize for latency and cost as workloads scale","Trend: The industry is moving from simple model serving toward dedicated deployments and in-house inference platforms","Future outlook: Increasing hardware specialization and the rise of agents will require highly optimized, workload-specific runtimes"],"chapters":[{"start_ms":60000,"title":"Introduction and Background","summary":"A brief introduction to Philip Kiely and his work in AI education and inference engineering."},{"start_ms":295000,"title":"The Evolution of AI Workloads","summary":"Tracing the shift from simple CPU-based classifiers to complex, GPU-accelerated generative models."},{"start_ms":525000,"title":"The Technical Levers of Inference","summary":"Deep dive into the mechanics of inference: quantization, speculation, KV cache reuse, and model parallelization."},{"start_ms":770000,"title":"Pushing the Envelope in Inference","summary":"Discussing the diminishing returns of low-hanging optimization techniques and the search for new frontiers."},{"start_ms":1020000,"title":"Engineering for SLAs and Reliability","summary":"How to design products around the realities of token pricing, uptime, and latency constraints."},{"start_ms":1280000,"title":"The Shift to Dedicated Deployments","summary":"Analyzing the transition from pay-per-token APIs to managing underlying GPU hardware for better control."},{"start_ms":1535000,"title":"Scaling Inference at the Edge and Enterprise","summary":"The challenges of building internal inference platforms and managing distributed edge networks."},{"start_ms":1785000,"title":"GPU Lifecycles and Hardware Economics","summary":"The impact of GPU depreciation, the longevity of the Hopper architecture, and the economics of rental markets."}],"topics":["Inference Engineering","GPU Optimization","Large Language Models","Quantization","Model Serving","Distributed Systems","AI Infrastructure","Machine Learning Operations"],"duration_seconds":3291,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/how-to-engineer-ai-inference-systems-with-philip-kiely-766/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/twiml-ai-podcast/how-to-engineer-ai-inference-systems-with-philip-kiely-766.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}