# Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757 Page: https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757 Text version: https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast) Published: 2025-12-02T22:29:00+00:00 Episode link: https://twimlai.com/podcast/twimlai/scaling-agentic-inference-across-heterogeneous-compute/ Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN2686987005.mp3?updated=1764715926 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757 Duration seconds: 2924 ## Resource The current reliance on high-end GPUs for all AI workloads is unsustainable for agentic systems that consume massive token volumes. Gimlet Labs proposes a heterogeneous inference architecture that disaggregates workloads across a mix of H100s, older GPUs, and CPUs to optimize unit economics. ## Highlights - Main idea: Agentic AI requires a shift from vertically integrated supercomputers to large-scale, commodity-hardware-based inference - Practical takeaway: Using a 'three-layer cake' architecture—workload disaggregation, compilation, and kernel synthesis—can significantly improve efficiency - Failure mode: Relying solely on high-end GPUs for agentic workloads leads to unsustainable costs due to high token consumption - Technical challenge: Maintaining numerical precision and verifying correctness when using LLMs to rewrite compute kernels across different hardware - Future trend: The rise of sovereign clouds and specialized data centers creates a massive opportunity for hardware-aware, heterogeneous scheduling ## Topics Agentic AI, Heterogeneous Computing, LLM Inference, Kernel Synthesis, GPU Optimization, Machine Learning Engineering, Cloud Infrastructure, Compute Economics ## Chapters - 1:00 — The Shift to Heterogeneous Inference: Introduction to the unsustainable nature of running agentic workloads exclusively on high-end GPUs and the need for efficiency. - 4:30 — Optimizing Efficiency and Latency: Discussing how understanding workload distribution allows for better optimization of both system latency and model performance. - 8:05 — Orchestration via Kubernetes: An overview of using Kubernetes and DRA to manage and orchestrate diverse hardware resources. - 18:55 — LLM-Driven Kernel Synthesis: Deep dive into the 'three-layer cake' architecture, specifically using MLIR, Torch, and LLMs to automate compute kernel optimization. - 22:30 — The Challenge of Numerical Precision: Exploring the difficulties of floating-point math errors and the impact of quantization on kernel verification. - 33:00 — Inference vs. Training Architectures: Contrasting the vertically integrated 'supercomputer' approach of training with the scalable, commodity-based approach needed for inference. - 40:10 — The Future of Agentic Clouds: Discussing the launch of developer-facing agent clouds and the importance of orchestrating asynchronous, large-scale workloads. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.