Episode

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Podcast: Latent Space: The AI Engineer Podcast
Published: Mar 10, 2026
Duration seconds: 5017
Processing state: processed
Canonical source: https://www.latent.space/p/nvidia-brev-dynamo
Audio: https://api.substack.com/feed/podcast/190477229/547dbb1e74e5465c243c3af0599b15cd.mp3
JSON: /v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo
Markdown: /podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md

Actions

POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

NVIDIA engineers discuss the architectural shift from single-node inference to planetary-scale distributed systems using NVIDIA Dynamo. The conversation explores the technical trade-offs between prefill and decode disaggregation, Kubernetes-based orchestration, and the security implications of autonomous agents.

Topics

AI Inference
NVIDIA Dynamo
Distributed Systems
Kubernetes
LLM Optimization
Agent Security
GPU Orchestration
Machine Learning Infrastructure

Highlights

Main idea: Scaling inference effectively requires moving beyond single-node limits to a 'scale-out' architecture that manages varying compute and memory demands
Technical takeaway: Implementing prefill/decode disaggregation allows for specialized hardware utilization, separating compute-bound prefill from memory-bound decode phases
Failure mode: Granting agents simultaneous access to file systems, custom code execution, and the internet creates critical security vulnerabilities
Practical takeaway: Use Kubernetes-based orchestration like Grove to dynamically adjust the ratio of prefill to decode resources based on workload shifts
Core tension: The fundamental engineering challenge in modern inference is balancing the 'trilemma' of cost, latency, and output quality

Chapters

1:00 NVIDIA GTC and Developer Experience: Reflections on the transition from startup life at Brev to NVIDIA and the importance of developer-centric marketing.
7:25 The Acquisition Journey: A look at the experience of being acquired and the shift toward providing one-click deployment experiences for developers.
13:45 The 'Speed of Light' (SOL) Philosophy: Discussing Jensen Huang's principle of operating with extreme urgency and its impact on engineering culture.
26:35 The Evolution of Inference: How inference has moved from a niche topic to a primary focus of large-scale AI infrastructure and industry discussion.
33:15 The Inference Trilemma: Analyzing the critical trade-offs between cost, latency, and model quality in production environments.
39:25 Disaggregated Prefill and Decode: Deep dive into the technical benefits of separating prefill and decode stages to optimize for compute and memory bounds.
58:15 Agent Security and Permissions: Strategies for securing autonomous agents by limiting their access to interconnected capabilities like code execution and the internet.