# Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750 Page: https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750 Text version: https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750.md Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast) Published: 2025-10-07T17:37:00+00:00 Episode link: https://twimlai.com/podcast/twimlai/recurrence-and-attention-for-long-context-transformers/ Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7068202936.mp3?updated=1759858524 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750 Duration seconds: 3443 ## Resource The Power Retention architecture solves the scaling bottleneck of long-context transformers by blending the parallelization of attention with the linear scaling of recurrence. This approach achieves massive speedups—over 10x during training and 100x during inference—without sacrificing context utility. ## Highlights - Main idea: Achieving long context requires balancing the weight-state FLOP ratio to ensure compute-optimal architectures - Practical takeaway: Use the PowerCoder 3B model to experiment with instruction fine-tuning and long-context performance - Failure mode: Windowed attention models often fail to utilize their full effective context, hitting a performance knee much earlier than expected - Technical insight: Power Retention allows for a 'metamorphosis' of existing models like Qwen to gain massive efficiency in long-context tasks - Efficiency metric: The architecture aims for a balanced ratio between parameter-based calculations (weight FLOPs) and state-based calculations (state FLOPs) ## Topics Transformers, Long-Context AI, Power Retention Architecture, Machine Learning Scaling Laws, GPU Optimization, Recurrence, Attention Mechanisms, Deep Learning Inference ## Chapters - 1:00 — Introduction to Long-Context Challenges: Jacob Buckman introduces the fundamental bottleneck in scaling AI: while weights and datasets scale well, context length remains a critical technical hurdle. - 5:25 — Measuring Context Utility: A discussion on the limitations of standard metrics like 'needle in a haystack' and the need for more robust ways to demonstrate long-context utility. - 22:40 — The Weight-State FLOP Ratio: An exploration of compute optimality through the lens of balancing parameter-based FLOPs against state-based FLOPs. - 31:05 — Architectural Imbalance: Why architectures with disproportionately large or small states are inefficient and how to use scaling laws to find the 'sweet spot'. - 39:30 — Optimizing with CUDA and Triton: The role of custom CUDA kernels and high-level abstractions in enabling efficient searches through the architecture space. - 48:10 — PowerCoder and Open Source Tools: An overview of Manifest AI's recent releases, including the PowerCoder 3B model and the Vidrial CUDA framework. - 52:30 — Scaling Laws and Future Directions: Analyzing the independent effects of scaling factors and the potential for massive context expansion in future models. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.