# The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764 Page: https://stenobird.com/podcast/twiml-ai-podcast/the-race-to-production-grade-diffusion-llms-with-stefano-ermon-764 Text version: https://stenobird.com/podcast/twiml-ai-podcast/the-race-to-production-grade-diffusion-llms-with-stefano-ermon-764.md Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast) Published: 2026-03-26T22:35:00+00:00 Episode link: https://twimlai.com/podcast/twimlai/race-production-grade-diffusion-llms Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN4110108991.mp3?updated=1774564986 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/the-race-to-production-grade-diffusion-llms-with-stefano-ermon-764 Duration seconds: 3798 ## Resource Diffusion models, traditionally used for images, are being adapted for text and code to enable faster, more efficient generation. This episode explores how these models can achieve 5-10x faster inference speeds and lower costs compared to traditional autoregressive LLMs. ## Highlights - Main idea: Diffusion models offer a high-performance alternative to autoregressive LLMs by generating tokens through iterative refinement rather than sequential prediction - Practical takeaway: Diffusion-based text generation can achieve 5-10x faster inference, making it ideal for latency-sensitive applications like voice AI - Technical challenge: Adapting continuous diffusion methods to discrete token spaces requires novel approaches like masking or embedding-space diffusion - Efficiency gain: Because diffusion models can generate multiple tokens simultaneously, they provide higher throughput and lower cost per GPU - Future outlook: The convergence of image, video, and text into unified multimodal diffusion architectures remains a major frontier in generative AI ## Topics Diffusion Models, Large Language Models, Generative AI, Inference Optimization, Multimodal AI, Machine Learning Architecture, Token Generation, Neural Networks ## Chapters - 1:05 — The Economic Case for Diffusion: Discussion on why the current era is ripe for diffusion models due to their superior serving efficiency and lower cost per token. - 6:00 — From Pixels to Text: A look at the fundamental shift from generating images via noise refinement to applying similar principles to language. - 10:55 — Diffusion in Embedding Space: Exploring technical methods for implementing diffusion models within continuous embedding spaces to handle text. - 15:40 — Token Masking and Noise Processes: An analysis of discrete noise processes, specifically using masking techniques to train models to predict missing tokens. - 20:15 — Reasoning and Inference Scaling: Investigating whether diffusion models can implement 'thinking traces' or adjustable denoising steps to simulate reasoning. - 25:15 — The Advantage of Throughput: How the ability to generate more tokens per GPU makes diffusion models a viable competitor for production-grade AI. - 34:50 — API Integration and User Experience: How diffusion models can be integrated into existing workflows using familiar parameters like reasoning effort. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/the-race-to-production-grade-diffusion-llms-with-stefano-ermon-764/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/the-race-to-production-grade-diffusion-llms-with-stefano-ermon-764.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.