Episode
NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)
- Published
- Mar 10, 2026
- Duration seconds
- 5017
- Processing state
processed- Canonical source
- https://www.latent.space/p/nvidia-brev-dynamo
Actions
POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
NVIDIA engineers discuss the architectural shift from single-node inference to planetary-scale distributed systems using NVIDIA Dynamo. The conversation explores the technical trade-offs between prefill and decode disaggregation, Kubernetes-based orchestration, and the security implications of autonomous agents.
Topics
- AI Inference
- NVIDIA Dynamo
- Distributed Systems
- Kubernetes
- LLM Optimization
- Agent Security
- GPU Orchestration
- Machine Learning Infrastructure
Highlights
- Main idea: Scaling inference effectively requires moving beyond single-node limits to a 'scale-out' architecture that manages varying compute and memory demands
- Technical takeaway: Implementing prefill/decode disaggregation allows for specialized hardware utilization, separating compute-bound prefill from memory-bound decode phases
- Failure mode: Granting agents simultaneous access to file systems, custom code execution, and the internet creates critical security vulnerabilities
- Practical takeaway: Use Kubernetes-based orchestration like Grove to dynamically adjust the ratio of prefill to decode resources based on workload shifts
- Core tension: The fundamental engineering challenge in modern inference is balancing the 'trilemma' of cost, latency, and output quality
Chapters
1:00NVIDIA GTC and Developer Experience: Reflections on the transition from startup life at Brev to NVIDIA and the importance of developer-centric marketing.7:25The Acquisition Journey: A look at the experience of being acquired and the shift toward providing one-click deployment experiences for developers.13:45The 'Speed of Light' (SOL) Philosophy: Discussing Jensen Huang's principle of operating with extreme urgency and its impact on engineering culture.26:35The Evolution of Inference: How inference has moved from a niche topic to a primary focus of large-scale AI infrastructure and industry discussion.33:15The Inference Trilemma: Analyzing the critical trade-offs between cost, latency, and model quality in production environments.39:25Disaggregated Prefill and Decode: Deep dive into the technical benefits of separating prefill and decode stages to optimize for compute and memory bounds.58:15Agent Security and Permissions: Strategies for securing autonomous agents by limiting their access to interconnected capabilities like code execution and the internet.