# Even the chip makers are making LLMs

Page: https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms
Text version: https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md
Podcast: [The Stack Overflow Podcast](https://stenobird.com/podcast/the-stack-overflow-podcast)
Published: 2026-03-10T04:00:00+00:00
Episode link: https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0
Audio file: https://rss.art19.com/episodes/18719303-bfa2-4ee9-adab-1b99e7e31740.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms
Duration seconds: 1613

## Resource

NVIDIA is moving beyond chip manufacturing to develop its own large language models through a hardware-software co-design loop. This strategy uses real-world model workloads to optimize next-generation GPU architectures and memory management.

## Highlights
- Main idea: NVIDIA employs a 'full stack' approach, using model development to inform hardware architecture and networking
- Practical takeaway: Hybrid architectures, such as combining Transformers with Mamba state space models, can significantly improve token efficiency
- Failure mode: Relying solely on dense Transformers can lead to unsustainable inference costs as context length increases
- Main idea: The Nemotron family represents a push toward open-source models with transparent training data and recipes
- Practical takeaway: Effective generative AI requires optimizing the entire system, including memory hierarchies and disaggregated serving

## Topics

NVIDIA, Large Language Models, Generative AI, GPU Architecture, Nemotron, Transformer Models, State Space Models, Hardware-Software Co-design, Machine Learning Inference

## Chapters
- 1:00 — The Full Stack Vision: NVIDIA's transition from a chipmaker to a full-stack company involved in model development.
- 2:55 — Hardware-Software Co-design: How understanding application workloads like deep learning drives the evolution of CUDA and GPU architecture.
- 4:50 — Optimizing for Performance: Sharing architectural recipes for Blackwell and Hopper to improve memory, performance, and scalability.
- 6:40 — Extreme Co-design: Matching model requirements to hardware constraints, specifically regarding memory and form factor.
- 8:35 — Disaggregated Serving: Innovations in inference, including splitting pre-fill and decode tasks across different GPUs.
- 10:30 — Hybrid Model Architectures: The development of Nemotron using a combination of Mamba state space models and Transformers.
- 12:30 — The Future of Architectures: Exploring the potential of diffusion models and purpose-built innovations for token efficiency.
- 14:25 — Systems of Models: Moving from single models to complex, interconnected systems of models and specialized memory management.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/even-the-chip-makers-are-making-llms/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/the-stack-overflow-podcast/even-the-chip-makers-are-making-llms.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.