{"podcast":{"title":"The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)","slug":"twiml-ai-podcast","podcast_index_feed_id":1045879,"rss_url":"https://feeds.megaphone.fm/MLN2155636147","website_url":"https://twimlai.com","image_url":"https://megaphone.imgix.net/podcasts/35230150-ee98-11eb-ad1a-b38cbabcd053/image/TWIML_AI_Podcast_Official_Cover_Art_1400px.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress","author":"TWIML","episode_count":785,"summary":"Machine learning and artificial intelligence are dramatically changing the way businesses operate and people live. The TWIML AI Podcast brings the top minds and ideas from the world of ML and AI to a broad and influential community of ML/AI researchers, data scientists, engineers and tech-savvy business and IT leaders. Hosted by Sam Charrington, a sought after industry analyst, speaker, commentator and thought leader. Technologies covered include machine learning, artificial intelligence, deep learning, natural language processing, neural networks, analytics, computer science, data science and more.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/twiml-ai-podcast"},"episode":{"title":"Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757","slug":"scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757","published_at":"2025-12-02T22:29:00+00:00","page_url":"https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757","show_page_url":"https://stenobird.com/podcast/twiml-ai-podcast","url":"https://twimlai.com/podcast/twimlai/scaling-agentic-inference-across-heterogeneous-compute/","audio_url":"https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN2686987005.mp3?updated=1764715926","summary":"The current reliance on high-end GPUs for all AI workloads is unsustainable for agentic systems that consume massive token volumes. Gimlet Labs proposes a heterogeneous inference architecture that disaggregates workloads across a mix of H100s, older GPUs, and CPUs to optimize unit economics.","meta_description":"Explore how Gimlet Labs uses heterogeneous compute and LLM-driven kernel synthesis to scale efficient, cost-effective agentic AI inference.","key_points":["Main idea: Agentic AI requires a shift from vertically integrated supercomputers to large-scale, commodity-hardware-based inference","Practical takeaway: Using a 'three-layer cake' architecture—workload disaggregation, compilation, and kernel synthesis—can significantly improve efficiency","Failure mode: Relying solely on high-end GPUs for agentic workloads leads to unsustainable costs due to high token consumption","Technical challenge: Maintaining numerical precision and verifying correctness when using LLMs to rewrite compute kernels across different hardware","Future trend: The rise of sovereign clouds and specialized data centers creates a massive opportunity for hardware-aware, heterogeneous scheduling"],"chapters":[{"start_ms":60000,"title":"The Shift to Heterogeneous Inference","summary":"Introduction to the unsustainable nature of running agentic workloads exclusively on high-end GPUs and the need for efficiency."},{"start_ms":270000,"title":"Optimizing Efficiency and Latency","summary":"Discussing how understanding workload distribution allows for better optimization of both system latency and model performance."},{"start_ms":485000,"title":"Orchestration via Kubernetes","summary":"An overview of using Kubernetes and DRA to manage and orchestrate diverse hardware resources."},{"start_ms":1135000,"title":"LLM-Driven Kernel Synthesis","summary":"Deep dive into the 'three-layer cake' architecture, specifically using MLIR, Torch, and LLMs to automate compute kernel optimization."},{"start_ms":1350000,"title":"The Challenge of Numerical Precision","summary":"Exploring the difficulties of floating-point math errors and the impact of quantization on kernel verification."},{"start_ms":1980000,"title":"Inference vs. Training Architectures","summary":"Contrasting the vertically integrated 'supercomputer' approach of training with the scalable, commodity-based approach needed for inference."},{"start_ms":2410000,"title":"The Future of Agentic Clouds","summary":"Discussing the launch of developer-facing agent clouds and the importance of orchestrating asynchronous, large-scale workloads."}],"topics":["Agentic AI","Heterogeneous Computing","LLM Inference","Kernel Synthesis","GPU Optimization","Machine Learning Engineering","Cloud Infrastructure","Compute Economics"],"duration_seconds":2924,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}