Episode
Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757
- Published
- Dec 2, 2025
- Duration seconds
- 2924
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
The current reliance on high-end GPUs for all AI workloads is unsustainable for agentic systems that consume massive token volumes. Gimlet Labs proposes a heterogeneous inference architecture that disaggregates workloads across a mix of H100s, older GPUs, and CPUs to optimize unit economics.
Topics
- Agentic AI
- Heterogeneous Computing
- LLM Inference
- Kernel Synthesis
- GPU Optimization
- Machine Learning Engineering
- Cloud Infrastructure
- Compute Economics
Highlights
- Main idea: Agentic AI requires a shift from vertically integrated supercomputers to large-scale, commodity-hardware-based inference
- Practical takeaway: Using a 'three-layer cake' architecture—workload disaggregation, compilation, and kernel synthesis—can significantly improve efficiency
- Failure mode: Relying solely on high-end GPUs for agentic workloads leads to unsustainable costs due to high token consumption
- Technical challenge: Maintaining numerical precision and verifying correctness when using LLMs to rewrite compute kernels across different hardware
- Future trend: The rise of sovereign clouds and specialized data centers creates a massive opportunity for hardware-aware, heterogeneous scheduling
Chapters
1:00The Shift to Heterogeneous Inference: Introduction to the unsustainable nature of running agentic workloads exclusively on high-end GPUs and the need for efficiency.4:30Optimizing Efficiency and Latency: Discussing how understanding workload distribution allows for better optimization of both system latency and model performance.8:05Orchestration via Kubernetes: An overview of using Kubernetes and DRA to manage and orchestrate diverse hardware resources.18:55LLM-Driven Kernel Synthesis: Deep dive into the 'three-layer cake' architecture, specifically using MLIR, Torch, and LLMs to automate compute kernel optimization.22:30The Challenge of Numerical Precision: Exploring the difficulties of floating-point math errors and the impact of quantization on kernel verification.33:00Inference vs. Training Architectures: Contrasting the vertically integrated 'supercomputer' approach of training with the scalable, commodity-based approach needed for inference.40:10The Future of Agentic Clouds: Discussing the launch of developer-facing agent clouds and the importance of orchestrating asynchronous, large-scale workloads.