Episode

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

Podcast: The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Published: Dec 2, 2025
Duration seconds: 2924
Processing state: processed
Canonical source: https://twimlai.com/podcast/twimlai/scaling-agentic-inference-across-heterogeneous-compute/
Audio: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN2686987005.mp3?updated=1764715926
JSON: /v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757
Markdown: /podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md

Actions

POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

The current reliance on high-end GPUs for all AI workloads is unsustainable for agentic systems that consume massive token volumes. Gimlet Labs proposes a heterogeneous inference architecture that disaggregates workloads across a mix of H100s, older GPUs, and CPUs to optimize unit economics.

Topics

Agentic AI
Heterogeneous Computing
LLM Inference
Kernel Synthesis
GPU Optimization
Machine Learning Engineering
Cloud Infrastructure
Compute Economics

Highlights

Main idea: Agentic AI requires a shift from vertically integrated supercomputers to large-scale, commodity-hardware-based inference
Practical takeaway: Using a 'three-layer cake' architecture—workload disaggregation, compilation, and kernel synthesis—can significantly improve efficiency
Failure mode: Relying solely on high-end GPUs for agentic workloads leads to unsustainable costs due to high token consumption
Technical challenge: Maintaining numerical precision and verifying correctness when using LLMs to rewrite compute kernels across different hardware
Future trend: The rise of sovereign clouds and specialized data centers creates a massive opportunity for hardware-aware, heterogeneous scheduling

Chapters

1:00 The Shift to Heterogeneous Inference: Introduction to the unsustainable nature of running agentic workloads exclusively on high-end GPUs and the need for efficiency.
4:30 Optimizing Efficiency and Latency: Discussing how understanding workload distribution allows for better optimization of both system latency and model performance.
8:05 Orchestration via Kubernetes: An overview of using Kubernetes and DRA to manage and orchestrate diverse hardware resources.
18:55 LLM-Driven Kernel Synthesis: Deep dive into the 'three-layer cake' architecture, specifically using MLIR, Torch, and LLMs to automate compute kernel optimization.
22:30 The Challenge of Numerical Precision: Exploring the difficulties of floating-point math errors and the impact of quantization on kernel verification.
33:00 Inference vs. Training Architectures: Contrasting the vertically integrated 'supercomputer' approach of training with the scalable, commodity-based approach needed for inference.
40:10 The Future of Agentic Clouds: Discussing the launch of developer-facing agent clouds and the importance of orchestrating asynchronous, large-scale workloads.