{"podcast":{"title":"Latent Space: The AI Engineer Podcast","slug":"latent-space-ai-engineer","podcast_index_feed_id":6058902,"rss_url":"https://api.substack.com/feed/podcast/1084089.rss","website_url":"https://www.latent.space/podcast","image_url":"https://substackcdn.com/feed/podcast/1084089/ca7468da5614a246d2906ee8926f6de7.jpg","author":"Latent.Space","episode_count":204,"summary":"The AI Engineer newsletter + Top technical AI podcast. How leading labs build Agents, Models, Infra, & AI for Science. See https://latent.space/about for highlights from Greg Brockman, Andrej Karpathy, George Hotz, Simon Willison, Soumith Chintala et al!","last_synced_at":null,"page_url":"https://stenobird.com/podcast/latent-space-ai-engineer"},"episode":{"title":"METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity","slug":"metr-s-joel-becker-on-exponential-time-horizon-evals-threat-models-and-the-limits-of-ai-productivity","published_at":"2026-02-27T19:17:52+00:00","page_url":"https://stenobird.com/podcast/latent-space-ai-engineer/metr-s-joel-becker-on-exponential-time-horizon-evals-threat-models-and-the-limits-of-ai-productivity","show_page_url":"https://stenobird.com/podcast/latent-space-ai-engineer","url":"https://www.latent.space/p/metr","audio_url":"https://api.substack.com/feed/podcast/189159777/9da6fbe2f2d5b3d14c2227e41401719c.mp3","summary":"METR's Joel Becker critiques the reductionist nature of AI benchmarks, arguing that single-number metrics like 'Time Horizon' obscure critical nuances in model capabilities. The discussion explores the tension between rapid compute scaling and the complex, multi-dimensional requirements for true autonomous agency.","meta_description":"Explore the limits of AI benchmarks, the reality of compute-driven progress, and why single-number metrics fail to capture true model capabilities.","key_points":["Main idea: Single-number metrics like Time Horizon collapse essential nuances regarding what models can actually achieve in real-world scenarios","Failure mode: Over-reliance on benchmarks like SWE-bench can lead to inflated expectations by ignoring the 'low-value' nature of certain automated tasks","Practical takeaway: Evaluating AI safety requires looking beyond capabilities to 'propensities'—how models behave when deployed in the wild","Main idea: The relationship between compute growth and model progress is non-linear and subject to significant uncertainty as scaling laws encounter physical limits","Critical insight: True autonomous agency requires much more than code generation; it necessitates physical-world interaction, such as managing data center hardware"],"chapters":[{"start_ms":60000,"title":"Defining METR","summary":"An introduction to METR's mission, focusing on Model Evaluation (ME) and Threat Research (TR)."},{"start_ms":310000,"title":"Task Selection and Biases","summary":"Discussing the difficulty of picking economically valuable tasks and the biases inherent in evaluating autonomy."},{"start_ms":560000,"title":"The Limits of Summary Statistics","summary":"Why reducing model performance to a single number fails to capture the gap between machine and human capabilities."},{"start_ms":800000,"title":"Trendlines and Model Progress","summary":"Analyzing whether recent model releases validate or falsify existing projections for capability doubling times."},{"start_ms":1060000,"title":"The Illusion of Productivity","summary":"Examining why high benchmark scores can lead to overly optimistic and inflated expectations of AI utility."},{"start_ms":1310000,"title":"Independent Threat Evaluation","summary":"The importance of third-party, non-lab-funded research in monitoring AI risks and propensities."},{"start_ms":1835000,"title":"Compute, Scaling, and Progress","summary":"Investigating how slowing compute growth might impact the trajectory of AI capabilities."}],"topics":["AI Safety","Model Evaluation","Compute Scaling","Autonomous Agents","Threat Modeling","AI Benchmarks","Machine Learning","AI Productivity"],"duration_seconds":3374,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/metr-s-joel-becker-on-exponential-time-horizon-evals-threat-models-and-the-limits-of-ai-productivity/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/latent-space-ai-engineer/metr-s-joel-becker-on-exponential-time-horizon-evals-threat-models-and-the-limits-of-ai-productivity.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}