# Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750

Page: https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750
Text version: https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750.md
Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast)
Published: 2025-10-07T17:37:00+00:00
Episode link: https://twimlai.com/podcast/twimlai/recurrence-and-attention-for-long-context-transformers/
Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7068202936.mp3?updated=1759858524
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750
Duration seconds: 3443

## Resource

The Power Retention architecture solves the scaling bottleneck of long-context transformers by blending the parallelization of attention with the linear scaling of recurrence. This approach achieves massive speedups—over 10x during training and 100x during inference—without sacrificing context utility.

## Highlights
- Main idea: Achieving long context requires balancing the weight-state FLOP ratio to ensure compute-optimal architectures
- Practical takeaway: Use the PowerCoder 3B model to experiment with instruction fine-tuning and long-context performance
- Failure mode: Windowed attention models often fail to utilize their full effective context, hitting a performance knee much earlier than expected
- Technical insight: Power Retention allows for a 'metamorphosis' of existing models like Qwen to gain massive efficiency in long-context tasks
- Efficiency metric: The architecture aims for a balanced ratio between parameter-based calculations (weight FLOPs) and state-based calculations (state FLOPs)

## Topics

Transformers, Long-Context AI, Power Retention Architecture, Machine Learning Scaling Laws, GPU Optimization, Recurrence, Attention Mechanisms, Deep Learning Inference

## Chapters
- 1:00 — Introduction to Long-Context Challenges: Jacob Buckman introduces the fundamental bottleneck in scaling AI: while weights and datasets scale well, context length remains a critical technical hurdle.
- 5:25 — Measuring Context Utility: A discussion on the limitations of standard metrics like 'needle in a haystack' and the need for more robust ways to demonstrate long-context utility.
- 22:40 — The Weight-State FLOP Ratio: An exploration of compute optimality through the lens of balancing parameter-based FLOPs against state-based FLOPs.
- 31:05 — Architectural Imbalance: Why architectures with disproportionately large or small states are inefficient and how to use scaling laws to find the 'sweet spot'.
- 39:30 — Optimizing with CUDA and Triton: The role of custom CUDA kernels and high-level abstractions in enabling efficient searches through the architecture space.
- 48:10 — PowerCoder and Open Source Tools: An overview of Manifest AI's recent releases, including the PowerCoder 3B model and the Vidrial CUDA framework.
- 52:30 — Scaling Laws and Future Directions: Analyzing the independent effects of scaling factors and the potential for massive context expansion in future models.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.