{"podcast":{"title":"The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)","slug":"twiml-ai-podcast","podcast_index_feed_id":1045879,"rss_url":"https://feeds.megaphone.fm/MLN2155636147","website_url":"https://twimlai.com","image_url":"https://megaphone.imgix.net/podcasts/35230150-ee98-11eb-ad1a-b38cbabcd053/image/TWIML_AI_Podcast_Official_Cover_Art_1400px.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress","author":"TWIML","episode_count":785,"summary":"Machine learning and artificial intelligence are dramatically changing the way businesses operate and people live. The TWIML AI Podcast brings the top minds and ideas from the world of ML and AI to a broad and influential community of ML/AI researchers, data scientists, engineers and tech-savvy business and IT leaders. Hosted by Sam Charrington, a sought after industry analyst, speaker, commentator and thought leader. Technologies covered include machine learning, artificial intelligence, deep learning, natural language processing, neural networks, analytics, computer science, data science and more.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/twiml-ai-podcast"},"episode":{"title":"Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750","slug":"recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750","published_at":"2025-10-07T17:37:00+00:00","page_url":"https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750","show_page_url":"https://stenobird.com/podcast/twiml-ai-podcast","url":"https://twimlai.com/podcast/twimlai/recurrence-and-attention-for-long-context-transformers/","audio_url":"https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7068202936.mp3?updated=1759858524","summary":"The Power Retention architecture solves the scaling bottleneck of long-context transformers by blending the parallelization of attention with the linear scaling of recurrence. This approach achieves massive speedups—over 10x during training and 100x during inference—without sacrificing context utility.","meta_description":"Explore the Power Retention architecture: a new way to achieve massive context lengths with 100x inference speedups using recurrence and attention.","key_points":["Main idea: Achieving long context requires balancing the weight-state FLOP ratio to ensure compute-optimal architectures","Practical takeaway: Use the PowerCoder 3B model to experiment with instruction fine-tuning and long-context performance","Failure mode: Windowed attention models often fail to utilize their full effective context, hitting a performance knee much earlier than expected","Technical insight: Power Retention allows for a 'metamorphosis' of existing models like Qwen to gain massive efficiency in long-context tasks","Efficiency metric: The architecture aims for a balanced ratio between parameter-based calculations (weight FLOPs) and state-based calculations (state FLOPs)"],"chapters":[{"start_ms":60000,"title":"Introduction to Long-Context Challenges","summary":"Jacob Buckman introduces the fundamental bottleneck in scaling AI: while weights and datasets scale well, context length remains a critical technical hurdle."},{"start_ms":325000,"title":"Measuring Context Utility","summary":"A discussion on the limitations of standard metrics like 'needle in a haystack' and the need for more robust ways to demonstrate long-context utility."},{"start_ms":1360000,"title":"The Weight-State FLOP Ratio","summary":"An exploration of compute optimality through the lens of balancing parameter-based FLOPs against state-based FLOPs."},{"start_ms":1865000,"title":"Architectural Imbalance","summary":"Why architectures with disproportionately large or small states are inefficient and how to use scaling laws to find the 'sweet spot'."},{"start_ms":2370000,"title":"Optimizing with CUDA and Triton","summary":"The role of custom CUDA kernels and high-level abstractions in enabling efficient searches through the architecture space."},{"start_ms":2890000,"title":"PowerCoder and Open Source Tools","summary":"An overview of Manifest AI's recent releases, including the PowerCoder 3B model and the Vidrial CUDA framework."},{"start_ms":3150000,"title":"Scaling Laws and Future Directions","summary":"Analyzing the independent effects of scaling factors and the potential for massive context expansion in future models."}],"topics":["Transformers","Long-Context AI","Power Retention Architecture","Machine Learning Scaling Laws","GPU Optimization","Recurrence","Attention Mechanisms","Deep Learning Inference"],"duration_seconds":3443,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/twiml-ai-podcast/recurrence-and-attention-for-long-context-transformers-with-jacob-buckman-750.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}