{"podcast":{"title":"The Stack Overflow Podcast","slug":"the-stack-overflow-podcast","podcast_index_feed_id":450923,"rss_url":"https://rss.art19.com/the-stack-overflow-podcast","website_url":"https://art19.com/shows/the-stack-overflow-podcast","image_url":"https://content.production.cdn.art19.com/images/f1/4b/a2/43/f14ba243-6fa1-48bc-88bb-16b5e90e01cf/9ab8462ecb3182c5303998dc1a19385c2c816946f95a9fa658457e657e3ea170cac950b4c623a4447028d0e31bb3b3e2ec62ad0b4d3fe42f5bc0419c6d811c9d.jpeg","author":"The Stack Overflow Podcast","episode_count":939,"summary":"For well over a decade, the Stack Overflow Podcast has been exploring what it means to be a developer and how the art and practice of software engineering is changing our world. From creating code to running it in production, we host important conversations and fascinating guests that will help you understand how technology is made and where it’s headed. Hosted by Ryan Donovan, the Stack Overflow Podcast is your home for all things software.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/the-stack-overflow-podcast"},"episode":{"title":"Even the chip makers are making LLMs","slug":"even-the-chip-makers-are-making-llms","published_at":"2026-03-10T04:00:00+00:00","page_url":"https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms","show_page_url":"https://stenobird.com/podcast/the-stack-overflow-podcast","url":"https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0","audio_url":"https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0","summary":"NVIDIA is moving beyond chip manufacturing to develop its own large language models through a hardware-software co-design loop. This strategy uses real-world model workloads to optimize next-generation GPU architectures and memory management.","meta_description":"Explore how NVIDIA uses extreme co-design between hardware and software to build efficient LLMs like Nemotron, optimizing for memory and token efficiency.","key_points":["Main idea: NVIDIA employs a 'full stack' approach, using model development to inform hardware architecture and networking","Practical takeaway: Hybrid architectures, such as combining Transformers with Mamba state space models, can significantly improve token efficiency","Failure mode: Relying solely on dense Transformers can lead to unsustainable inference costs as context length increases","Main idea: The Nemotron family represents a push toward open-source models with transparent training data and recipes","Practical takeaway: Effective generative AI requires optimizing the entire system, including memory hierarchies and disaggregated serving"],"chapters":[{"start_ms":60000,"title":"The Full Stack Vision","summary":"NVIDIA's transition from a chipmaker to a full-stack company involved in model development."},{"start_ms":175000,"title":"Hardware-Software Co-design","summary":"How understanding application workloads like deep learning drives the evolution of CUDA and GPU architecture."},{"start_ms":290000,"title":"Optimizing for Performance","summary":"Sharing architectural recipes for Blackwell and Hopper to improve memory, performance, and scalability."},{"start_ms":400000,"title":"Extreme Co-design","summary":"Matching model requirements to hardware constraints, specifically regarding memory and form factor."},{"start_ms":515000,"title":"Disaggregated Serving","summary":"Innovations in inference, including splitting pre-fill and decode tasks across different GPUs."},{"start_ms":630000,"title":"Hybrid Model Architectures","summary":"The development of Nemotron using a combination of Mamba state space models and Transformers."},{"start_ms":750000,"title":"The Future of Architectures","summary":"Exploring the potential of diffusion models and purpose-built innovations for token efficiency."},{"start_ms":865000,"title":"Systems of Models","summary":"Moving from single models to complex, interconnected systems of models and specialized memory management."}],"topics":["NVIDIA","Large Language Models","Generative AI","GPU Architecture","Nemotron","Transformer Models","State Space Models","Hardware-Software Co-design","Machine Learning Inference"],"duration_seconds":1613,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}