Episode

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Podcast
Latent Space: The AI Engineer Podcast
Published
Mar 10, 2026
Duration seconds
5017
Processing state
processed
Canonical source
https://www.latent.space/p/nvidia-brev-dynamo
Audio
https://api.substack.com/feed/podcast/190477229/547dbb1e74e5465c243c3af0599b15cd.mp3
JSON
/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo
Markdown
/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

NVIDIA engineers discuss the architectural shift from single-node inference to planetary-scale distributed systems using NVIDIA Dynamo. The conversation explores the technical trade-offs between prefill and decode disaggregation, Kubernetes-based orchestration, and the security implications of autonomous agents.

Topics

  • AI Inference
  • NVIDIA Dynamo
  • Distributed Systems
  • Kubernetes
  • LLM Optimization
  • Agent Security
  • GPU Orchestration
  • Machine Learning Infrastructure

Highlights

  • Main idea: Scaling inference effectively requires moving beyond single-node limits to a 'scale-out' architecture that manages varying compute and memory demands
  • Technical takeaway: Implementing prefill/decode disaggregation allows for specialized hardware utilization, separating compute-bound prefill from memory-bound decode phases
  • Failure mode: Granting agents simultaneous access to file systems, custom code execution, and the internet creates critical security vulnerabilities
  • Practical takeaway: Use Kubernetes-based orchestration like Grove to dynamically adjust the ratio of prefill to decode resources based on workload shifts
  • Core tension: The fundamental engineering challenge in modern inference is balancing the 'trilemma' of cost, latency, and output quality

Chapters

  1. 1:00 NVIDIA GTC and Developer Experience: Reflections on the transition from startup life at Brev to NVIDIA and the importance of developer-centric marketing.
  2. 7:25 The Acquisition Journey: A look at the experience of being acquired and the shift toward providing one-click deployment experiences for developers.
  3. 13:45 The 'Speed of Light' (SOL) Philosophy: Discussing Jensen Huang's principle of operating with extreme urgency and its impact on engineering culture.
  4. 26:35 The Evolution of Inference: How inference has moved from a niche topic to a primary focus of large-scale AI infrastructure and industry discussion.
  5. 33:15 The Inference Trilemma: Analyzing the critical trade-offs between cost, latency, and model quality in production environments.
  6. 39:25 Disaggregated Prefill and Decode: Deep dive into the technical benefits of separating prefill and decode stages to optimize for compute and memory bounds.
  7. 58:15 Agent Security and Permissions: Strategies for securing autonomous agents by limiting their access to interconnected capabilities like code execution and the internet.