# NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo) Page: https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo Text version: https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer) Published: 2026-03-10T06:40:22+00:00 Episode link: https://www.latent.space/p/nvidia-brev-dynamo Audio file: https://api.substack.com/feed/podcast/190477229/547dbb1e74e5465c243c3af0599b15cd.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo Duration seconds: 5017 ## Resource NVIDIA engineers discuss the architectural shift from single-node inference to planetary-scale distributed systems using NVIDIA Dynamo. The conversation explores the technical trade-offs between prefill and decode disaggregation, Kubernetes-based orchestration, and the security implications of autonomous agents. ## Highlights - Main idea: Scaling inference effectively requires moving beyond single-node limits to a 'scale-out' architecture that manages varying compute and memory demands - Technical takeaway: Implementing prefill/decode disaggregation allows for specialized hardware utilization, separating compute-bound prefill from memory-bound decode phases - Failure mode: Granting agents simultaneous access to file systems, custom code execution, and the internet creates critical security vulnerabilities - Practical takeaway: Use Kubernetes-based orchestration like Grove to dynamically adjust the ratio of prefill to decode resources based on workload shifts - Core tension: The fundamental engineering challenge in modern inference is balancing the 'trilemma' of cost, latency, and output quality ## Topics AI Inference, NVIDIA Dynamo, Distributed Systems, Kubernetes, LLM Optimization, Agent Security, GPU Orchestration, Machine Learning Infrastructure ## Chapters - 1:00 — NVIDIA GTC and Developer Experience: Reflections on the transition from startup life at Brev to NVIDIA and the importance of developer-centric marketing. - 7:25 — The Acquisition Journey: A look at the experience of being acquired and the shift toward providing one-click deployment experiences for developers. - 13:45 — The 'Speed of Light' (SOL) Philosophy: Discussing Jensen Huang's principle of operating with extreme urgency and its impact on engineering culture. - 26:35 — The Evolution of Inference: How inference has moved from a niche topic to a primary focus of large-scale AI infrastructure and industry discussion. - 33:15 — The Inference Trilemma: Analyzing the critical trade-offs between cost, latency, and model quality in production environments. - 39:25 — Disaggregated Prefill and Decode: Deep dive into the technical benefits of separating prefill and decode stages to optimize for compute and memory bounds. - 58:15 — Agent Security and Permissions: Strategies for securing autonomous agents by limiting their access to interconnected capabilities like code execution and the internet. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.