{"podcast":{"title":"Latent Space: The AI Engineer Podcast","slug":"latent-space-ai-engineer","podcast_index_feed_id":6058902,"rss_url":"https://api.substack.com/feed/podcast/1084089.rss","website_url":"https://www.latent.space/podcast","image_url":"https://substackcdn.com/feed/podcast/1084089/ca7468da5614a246d2906ee8926f6de7.jpg","author":"Latent.Space","episode_count":214,"summary":"The AI Engineer newsletter + Top technical AI podcast. How leading labs build Agents, Models, Infra, & AI for Science. See https://latent.space/about for highlights from Greg Brockman, Andrej Karpathy, George Hotz, Simon Willison, Soumith Chintala et al!","last_synced_at":"2026-07-17T00:20:53.505905+00:00","page_url":"https://stenobird.com/podcast/latent-space-ai-engineer"},"episode":{"title":"NVIDIA's AI Engineers: Agent Inference at Planetary Scale and \"Speed of Light\" — Nader Khalil (Brev), Kyle Kranen (Dynamo)","slug":"nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo","published_at":"2026-03-10T06:40:22+00:00","page_url":"https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo","show_page_url":"https://stenobird.com/podcast/latent-space-ai-engineer","url":"https://www.latent.space/p/nvidia-brev-dynamo","audio_url":"https://api.substack.com/feed/podcast/190477229/547dbb1e74e5465c243c3af0599b15cd.mp3","summary":"NVIDIA engineers discuss the architectural shift from single-node inference to planetary-scale distributed systems using NVIDIA Dynamo. The conversation explores the technical trade-offs between prefill and decode disaggregation, Kubernetes-based orchestration, and the security implications of autonomous agents.","meta_description":"Explore NVIDIA's approach to large-scale AI inference, including Dynamo's architecture, prefill/decode disaggregation, and agent security.","key_points":["Main idea: Scaling inference effectively requires moving beyond single-node limits to a 'scale-out' architecture that manages varying compute and memory demands","Technical takeaway: Implementing prefill/decode disaggregation allows for specialized hardware utilization, separating compute-bound prefill from memory-bound decode phases","Failure mode: Granting agents simultaneous access to file systems, custom code execution, and the internet creates critical security vulnerabilities","Practical takeaway: Use Kubernetes-based orchestration like Grove to dynamically adjust the ratio of prefill to decode resources based on workload shifts","Core tension: The fundamental engineering challenge in modern inference is balancing the 'trilemma' of cost, latency, and output quality"],"chapters":[{"start_ms":60000,"title":"NVIDIA GTC and Developer Experience","summary":"Reflections on the transition from startup life at Brev to NVIDIA and the importance of developer-centric marketing."},{"start_ms":445000,"title":"The Acquisition Journey","summary":"A look at the experience of being acquired and the shift toward providing one-click deployment experiences for developers."},{"start_ms":825000,"title":"The 'Speed of Light' (SOL) Philosophy","summary":"Discussing Jensen Huang's principle of operating with extreme urgency and its impact on engineering culture."},{"start_ms":1595000,"title":"The Evolution of Inference","summary":"How inference has moved from a niche topic to a primary focus of large-scale AI infrastructure and industry discussion."},{"start_ms":1995000,"title":"The Inference Trilemma","summary":"Analyzing the critical trade-offs between cost, latency, and model quality in production environments."},{"start_ms":2365000,"title":"Disaggregated Prefill and Decode","summary":"Deep dive into the technical benefits of separating prefill and decode stages to optimize for compute and memory bounds."},{"start_ms":3495000,"title":"Agent Security and Permissions","summary":"Strategies for securing autonomous agents by limiting their access to interconnected capabilities like code execution and the internet."}],"topics":["AI Inference","NVIDIA Dynamo","Distributed Systems","Kubernetes","LLM Optimization","Agent Security","GPU Orchestration","Machine Learning Infrastructure"],"duration_seconds":5017,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/latent-space-ai-engineer/nvidia-s-ai-engineers-agent-inference-at-planetary-scale-and-speed-of-light-nader-khalil-brev-kyle-kranen-dynamo.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}