# Even the chip makers are making LLMs Page: https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms Text version: https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md Podcast: [The Stack Overflow Podcast](https://stenobird.com/podcast/the-stack-overflow-podcast) Published: 2026-03-10T04:00:00+00:00 Episode link: https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0 Audio file: https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms Duration seconds: 1613 ## Resource NVIDIA is moving beyond chip manufacturing to develop its own large language models through a hardware-software co-design loop. This strategy uses real-world model workloads to optimize next-generation GPU architectures and memory management. ## Highlights - Main idea: NVIDIA employs a 'full stack' approach, using model development to inform hardware architecture and networking - Practical takeaway: Hybrid architectures, such as combining Transformers with Mamba state space models, can significantly improve token efficiency - Failure mode: Relying solely on dense Transformers can lead to unsustainable inference costs as context length increases - Main idea: The Nemotron family represents a push toward open-source models with transparent training data and recipes - Practical takeaway: Effective generative AI requires optimizing the entire system, including memory hierarchies and disaggregated serving ## Topics NVIDIA, Large Language Models, Generative AI, GPU Architecture, Nemotron, Transformer Models, State Space Models, Hardware-Software Co-design, Machine Learning Inference ## Chapters - 1:00 — The Full Stack Vision: NVIDIA's transition from a chipmaker to a full-stack company involved in model development. - 2:55 — Hardware-Software Co-design: How understanding application workloads like deep learning drives the evolution of CUDA and GPU architecture. - 4:50 — Optimizing for Performance: Sharing architectural recipes for Blackwell and Hopper to improve memory, performance, and scalability. - 6:40 — Extreme Co-design: Matching model requirements to hardware constraints, specifically regarding memory and form factor. - 8:35 — Disaggregated Serving: Innovations in inference, including splitting pre-fill and decode tasks across different GPUs. - 10:30 — Hybrid Model Architectures: The development of Nemotron using a combination of Mamba state space models and Transformers. - 12:30 — The Future of Architectures: Exploring the potential of diffusion models and purpose-built innovations for token efficiency. - 14:25 — Systems of Models: Moving from single models to complex, interconnected systems of models and specialized memory management. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.