Episode

Even the chip makers are making LLMs

Podcast: The Stack Overflow Podcast
Published: Mar 10, 2026
Duration seconds: 1613
Processing state: processed
Canonical source: https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0
Audio: https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0
JSON: /v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms
Markdown: /podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md

Actions

POST https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

NVIDIA is moving beyond chip manufacturing to develop its own large language models through a hardware-software co-design loop. This strategy uses real-world model workloads to optimize next-generation GPU architectures and memory management.

Topics

NVIDIA
Large Language Models
Generative AI
GPU Architecture
Nemotron
Transformer Models
State Space Models
Hardware-Software Co-design
Machine Learning Inference

Highlights

Main idea: NVIDIA employs a 'full stack' approach, using model development to inform hardware architecture and networking
Practical takeaway: Hybrid architectures, such as combining Transformers with Mamba state space models, can significantly improve token efficiency
Failure mode: Relying solely on dense Transformers can lead to unsustainable inference costs as context length increases
Main idea: The Nemotron family represents a push toward open-source models with transparent training data and recipes
Practical takeaway: Effective generative AI requires optimizing the entire system, including memory hierarchies and disaggregated serving

Chapters

1:00 The Full Stack Vision: NVIDIA's transition from a chipmaker to a full-stack company involved in model development.
2:55 Hardware-Software Co-design: How understanding application workloads like deep learning drives the evolution of CUDA and GPU architecture.
4:50 Optimizing for Performance: Sharing architectural recipes for Blackwell and Hopper to improve memory, performance, and scalability.
6:40 Extreme Co-design: Matching model requirements to hardware constraints, specifically regarding memory and form factor.
8:35 Disaggregated Serving: Innovations in inference, including splitting pre-fill and decode tasks across different GPUs.
10:30 Hybrid Model Architectures: The development of Nemotron using a combination of Mamba state space models and Transformers.
12:30 The Future of Architectures: Exploring the potential of diffusion models and purpose-built innovations for token efficiency.
14:25 Systems of Models: Moving from single models to complex, interconnected systems of models and specialized memory management.