# Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

Page: https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757
Text version: https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md
Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast)
Published: 2025-12-02T22:29:00+00:00
Episode link: https://twimlai.com/podcast/twimlai/scaling-agentic-inference-across-heterogeneous-compute/
Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN2686987005.mp3?updated=1764715926
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757
Duration seconds: 2924

## Resource

The current reliance on high-end GPUs for all AI workloads is unsustainable for agentic systems that consume massive token volumes. Gimlet Labs proposes a heterogeneous inference architecture that disaggregates workloads across a mix of H100s, older GPUs, and CPUs to optimize unit economics.

## Highlights
- Main idea: Agentic AI requires a shift from vertically integrated supercomputers to large-scale, commodity-hardware-based inference
- Practical takeaway: Using a 'three-layer cake' architecture—workload disaggregation, compilation, and kernel synthesis—can significantly improve efficiency
- Failure mode: Relying solely on high-end GPUs for agentic workloads leads to unsustainable costs due to high token consumption
- Technical challenge: Maintaining numerical precision and verifying correctness when using LLMs to rewrite compute kernels across different hardware
- Future trend: The rise of sovereign clouds and specialized data centers creates a massive opportunity for hardware-aware, heterogeneous scheduling

## Topics

Agentic AI, Heterogeneous Computing, LLM Inference, Kernel Synthesis, GPU Optimization, Machine Learning Engineering, Cloud Infrastructure, Compute Economics

## Chapters
- 1:00 — The Shift to Heterogeneous Inference: Introduction to the unsustainable nature of running agentic workloads exclusively on high-end GPUs and the need for efficiency.
- 4:30 — Optimizing Efficiency and Latency: Discussing how understanding workload distribution allows for better optimization of both system latency and model performance.
- 8:05 — Orchestration via Kubernetes: An overview of using Kubernetes and DRA to manage and orchestrate diverse hardware resources.
- 18:55 — LLM-Driven Kernel Synthesis: Deep dive into the 'three-layer cake' architecture, specifically using MLIR, Torch, and LLMs to automate compute kernel optimization.
- 22:30 — The Challenge of Numerical Precision: Exploring the difficulties of floating-point math errors and the impact of quantization on kernel verification.
- 33:00 — Inference vs. Training Architectures: Contrasting the vertically integrated 'supercomputer' approach of training with the scalable, commodity-based approach needed for inference.
- 40:10 — The Future of Agentic Clouds: Discussing the launch of developer-facing agent clouds and the importance of orchestrating asynchronous, large-scale workloads.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.