{"podcast":{"title":"Chain of Thought | AI Agents, Infrastructure & Engineering","slug":"chain-of-thought-ai-agents","podcast_index_feed_id":7074333,"rss_url":"https://feeds.transistor.fm/chain-of-thought","website_url":"https://newsletter.chainofthought.show/","image_url":"https://img.transistorcdn.com/Yf0B0akhfAtC-Ahrn6UylfgdusLiIKiLKXlMy29dfwI/rs:fill:0:0:1/w:1400/h:1400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS81YjBh/ODMzMTY3ZjQ0MjBj/YTE1ODMwYTZlNDgx/Mjc2Mi5qcGc.jpg","author":"Conor Bronsdon","episode_count":63,"summary":"AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead. Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes weekly. Conor Bronsdon is an angel investor in AI and dev tools, Technical Ecosystem Lead at Modular, and previously led growth at AI startups Galileo and LinearB. Disclaimer: All views, opinions and statements expressed on this account are solely my own and are made in my personal capacity. They do not reflect, and should not be construed as reflecting, the views, positions, or policies of Modular. This account is not affiliated with, authorized by, or endorsed by Modular in any way.","last_synced_at":"2026-06-12T00:17:44.387836+00:00","page_url":"https://stenobird.com/podcast/chain-of-thought-ai-agents"},"episode":{"title":"Every AI Agent Has an Evaluation Gap | Alex Ratner, Snorkel AI","slug":"every-ai-agent-has-an-evaluation-gap-alex-ratner-snorkel-ai","published_at":"2026-04-29T11:58:48+00:00","page_url":"https://stenobird.com/podcast/chain-of-thought-ai-agents/every-ai-agent-has-an-evaluation-gap-alex-ratner-snorkel-ai","show_page_url":"https://stenobird.com/podcast/chain-of-thought-ai-agents","url":"https://share.transistor.fm/s/18593a4c","audio_url":"https://media.transistor.fm/18593a4c/f07fff44.mp3","summary":"The rapid advancement of AI agent capabilities has outpaced our ability to measure them, creating a dangerous 'evaluation gap' in enterprise applications. Alex Ratner explains why solving this requires moving beyond simple benchmarks toward a holistic integration of task, environment, and data.","meta_description":"Snorkel AI CEO Alex Ratner discusses the evaluation gap in AI agents, the importance of open benchmarks, and the shift toward data-centric AI development.","key_points":["Main idea: The 'evaluation gap' occurs because agent capabilities are advancing faster than the metrics used to verify their reliability in high-stakes enterprise settings","Failure mode: 'Benchmaxing'—the tendency for models to overfit to public benchmarks, which provides a false sense of capability without real-world utility","Practical takeaway: Effective agent development requires a holistic approach where the task, the environment, and the data are designed and evaluated together","Main idea: Data is shifting from an upstream preprocessing step to the central engine of AI development and model refinement","Practical takeaway: To move agents into production, companies must move beyond simple answer keys toward complex, use-case-specific private benchmarks"],"chapters":[{"start_ms":60000,"title":"The Origins of Data-Centric AI","summary":"Introduction to Alex Ratner and his work establishing the field of data-centric AI at Stanford and Snorkel AI."},{"start_ms":245000,"title":"The Enterprise Risk Profile","summary":"Discussing the high stakes of error in enterprise AI and how the 'jagged frontier' of capabilities creates unpredictable risks."},{"start_ms":435000,"title":"The Measurement Crisis","summary":"How the complexity of modern AI capabilities is making it increasingly difficult to create reliable measurement tools."},{"start_ms":625000,"title":"Building Specialized Benchmarks","summary":"A look at Snorkel's work with legal AI (Harvey) to create specialized benchmarks like Big Law Bench."},{"start_ms":1200000,"title":"The Danger of Benchmaxing","summary":"Addressing the backlash against public benchmarks and the risks of models overfitting to standardized tests."},{"start_ms":1580000,"title":"Data as the Epicenter of AI","summary":"Exploring the hypothesis that data, rather than model architecture, will become the primary driver of AI performance."},{"start_ms":2355000,"title":"The Integration of Environment and Data","summary":"Why environment vendors and data vendors must collaborate to create functional, real-world AI agents."}],"topics":["AI Agents","Evaluation Gap","Data-Centric AI","Machine Learning Benchmarks","Enterprise AI","Synthetic Data","Model Evaluation","Snorkel AI"],"duration_seconds":2560,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/chain-of-thought-ai-agents/episodes/every-ai-agent-has-an-evaluation-gap-alex-ratner-snorkel-ai/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/chain-of-thought-ai-agents/every-ai-agent-has-an-evaluation-gap-alex-ratner-snorkel-ai.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}