# NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Page: https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo
Text version: https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md
Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer)
Published: 2026-03-10T06:40:22+00:00
Episode link: https://www.latent.space/p/nvidia-brev-dynamo
Audio file: https://api.substack.com/feed/podcast/190477229/547dbb1e74e5465c243c3af0599b15cd.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo
Duration seconds: 5017

## Resource

NVIDIA engineers discuss the architectural shift from single-node inference to planetary-scale distributed systems using NVIDIA Dynamo. The conversation explores the technical trade-offs between prefill and decode disaggregation, Kubernetes-based orchestration, and the security implications of autonomous agents.

## Highlights
- Main idea: Scaling inference effectively requires moving beyond single-node limits to a 'scale-out' architecture that manages varying compute and memory demands
- Technical takeaway: Implementing prefill/decode disaggregation allows for specialized hardware utilization, separating compute-bound prefill from memory-bound decode phases
- Failure mode: Granting agents simultaneous access to file systems, custom code execution, and the internet creates critical security vulnerabilities
- Practical takeaway: Use Kubernetes-based orchestration like Grove to dynamically adjust the ratio of prefill to decode resources based on workload shifts
- Core tension: The fundamental engineering challenge in modern inference is balancing the 'trilemma' of cost, latency, and output quality

## Topics

AI Inference, NVIDIA Dynamo, Distributed Systems, Kubernetes, LLM Optimization, Agent Security, GPU Orchestration, Machine Learning Infrastructure

## Chapters
- 1:00 — NVIDIA GTC and Developer Experience: Reflections on the transition from startup life at Brev to NVIDIA and the importance of developer-centric marketing.
- 7:25 — The Acquisition Journey: A look at the experience of being acquired and the shift toward providing one-click deployment experiences for developers.
- 13:45 — The 'Speed of Light' (SOL) Philosophy: Discussing Jensen Huang's principle of operating with extreme urgency and its impact on engineering culture.
- 26:35 — The Evolution of Inference: How inference has moved from a niche topic to a primary focus of large-scale AI infrastructure and industry discussion.
- 33:15 — The Inference Trilemma: Analyzing the critical trade-offs between cost, latency, and model quality in production environments.
- 39:25 — Disaggregated Prefill and Decode: Deep dive into the technical benefits of separating prefill and decode stages to optimize for compute and memory bounds.
- 58:15 — Agent Security and Permissions: Strategies for securing autonomous agents by limiting their access to interconnected capabilities like code execution and the internet.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.