Episode

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

Podcast
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Published
Dec 2, 2025
Duration seconds
2924
Processing state
processed
Canonical source
https://twimlai.com/podcast/twimlai/scaling-agentic-inference-across-heterogeneous-compute/
Audio
https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN2686987005.mp3?updated=1764715926
JSON
/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757
Markdown
/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/twiml-ai-podcast/scaling-agentic-inference-across-heterogeneous-compute-with-zain-asgar-757.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

The current reliance on high-end GPUs for all AI workloads is unsustainable for agentic systems that consume massive token volumes. Gimlet Labs proposes a heterogeneous inference architecture that disaggregates workloads across a mix of H100s, older GPUs, and CPUs to optimize unit economics.

Topics

  • Agentic AI
  • Heterogeneous Computing
  • LLM Inference
  • Kernel Synthesis
  • GPU Optimization
  • Machine Learning Engineering
  • Cloud Infrastructure
  • Compute Economics

Highlights

  • Main idea: Agentic AI requires a shift from vertically integrated supercomputers to large-scale, commodity-hardware-based inference
  • Practical takeaway: Using a 'three-layer cake' architecture—workload disaggregation, compilation, and kernel synthesis—can significantly improve efficiency
  • Failure mode: Relying solely on high-end GPUs for agentic workloads leads to unsustainable costs due to high token consumption
  • Technical challenge: Maintaining numerical precision and verifying correctness when using LLMs to rewrite compute kernels across different hardware
  • Future trend: The rise of sovereign clouds and specialized data centers creates a massive opportunity for hardware-aware, heterogeneous scheduling

Chapters

  1. 1:00 The Shift to Heterogeneous Inference: Introduction to the unsustainable nature of running agentic workloads exclusively on high-end GPUs and the need for efficiency.
  2. 4:30 Optimizing Efficiency and Latency: Discussing how understanding workload distribution allows for better optimization of both system latency and model performance.
  3. 8:05 Orchestration via Kubernetes: An overview of using Kubernetes and DRA to manage and orchestrate diverse hardware resources.
  4. 18:55 LLM-Driven Kernel Synthesis: Deep dive into the 'three-layer cake' architecture, specifically using MLIR, Torch, and LLMs to automate compute kernel optimization.
  5. 22:30 The Challenge of Numerical Precision: Exploring the difficulties of floating-point math errors and the impact of quantization on kernel verification.
  6. 33:00 Inference vs. Training Architectures: Contrasting the vertically integrated 'supercomputer' approach of training with the scalable, commodity-based approach needed for inference.
  7. 40:10 The Future of Agentic Clouds: Discussing the launch of developer-facing agent clouds and the importance of orchestrating asynchronous, large-scale workloads.