Episode

Even the chip makers are making LLMs

Podcast
The Stack Overflow Podcast
Published
Mar 10, 2026
Duration seconds
1613
Processing state
processed
Canonical source
https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0
Audio
https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0
JSON
/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms
Markdown
/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

NVIDIA is moving beyond chip manufacturing to develop its own large language models through a hardware-software co-design loop. This strategy uses real-world model workloads to optimize next-generation GPU architectures and memory management.

Topics

  • NVIDIA
  • Large Language Models
  • Generative AI
  • GPU Architecture
  • Nemotron
  • Transformer Models
  • State Space Models
  • Hardware-Software Co-design
  • Machine Learning Inference

Highlights

  • Main idea: NVIDIA employs a 'full stack' approach, using model development to inform hardware architecture and networking
  • Practical takeaway: Hybrid architectures, such as combining Transformers with Mamba state space models, can significantly improve token efficiency
  • Failure mode: Relying solely on dense Transformers can lead to unsustainable inference costs as context length increases
  • Main idea: The Nemotron family represents a push toward open-source models with transparent training data and recipes
  • Practical takeaway: Effective generative AI requires optimizing the entire system, including memory hierarchies and disaggregated serving

Chapters

  1. 1:00 The Full Stack Vision: NVIDIA's transition from a chipmaker to a full-stack company involved in model development.
  2. 2:55 Hardware-Software Co-design: How understanding application workloads like deep learning drives the evolution of CUDA and GPU architecture.
  3. 4:50 Optimizing for Performance: Sharing architectural recipes for Blackwell and Hopper to improve memory, performance, and scalability.
  4. 6:40 Extreme Co-design: Matching model requirements to hardware constraints, specifically regarding memory and form factor.
  5. 8:35 Disaggregated Serving: Innovations in inference, including splitting pre-fill and decode tasks across different GPUs.
  6. 10:30 Hybrid Model Architectures: The development of Nemotron using a combination of Mamba state space models and Transformers.
  7. 12:30 The Future of Architectures: Exploring the potential of diffusion models and purpose-built innovations for token efficiency.
  8. 14:25 Systems of Models: Moving from single models to complex, interconnected systems of models and specialized memory management.