Episode
Even the chip makers are making LLMs
- Podcast
- The Stack Overflow Podcast
- Published
- Mar 10, 2026
- Duration seconds
- 1613
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
NVIDIA is moving beyond chip manufacturing to develop its own large language models through a hardware-software co-design loop. This strategy uses real-world model workloads to optimize next-generation GPU architectures and memory management.
Topics
- NVIDIA
- Large Language Models
- Generative AI
- GPU Architecture
- Nemotron
- Transformer Models
- State Space Models
- Hardware-Software Co-design
- Machine Learning Inference
Highlights
- Main idea: NVIDIA employs a 'full stack' approach, using model development to inform hardware architecture and networking
- Practical takeaway: Hybrid architectures, such as combining Transformers with Mamba state space models, can significantly improve token efficiency
- Failure mode: Relying solely on dense Transformers can lead to unsustainable inference costs as context length increases
- Main idea: The Nemotron family represents a push toward open-source models with transparent training data and recipes
- Practical takeaway: Effective generative AI requires optimizing the entire system, including memory hierarchies and disaggregated serving
Chapters
1:00The Full Stack Vision: NVIDIA's transition from a chipmaker to a full-stack company involved in model development.2:55Hardware-Software Co-design: How understanding application workloads like deep learning drives the evolution of CUDA and GPU architecture.4:50Optimizing for Performance: Sharing architectural recipes for Blackwell and Hopper to improve memory, performance, and scalability.6:40Extreme Co-design: Matching model requirements to hardware constraints, specifically regarding memory and form factor.8:35Disaggregated Serving: Innovations in inference, including splitting pre-fill and decode tasks across different GPUs.10:30Hybrid Model Architectures: The development of Nemotron using a combination of Mamba state space models and Transformers.12:30The Future of Architectures: Exploring the potential of diffusion models and purpose-built innovations for token efficiency.14:25Systems of Models: Moving from single models to complex, interconnected systems of models and specialized memory management.