Episode
Proactive Agents for the Web with Devi Parikh - #756
- Published
- Nov 19, 2025
- Duration seconds
- 3364
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/proactive-agents-for-the-web-with-devi-parikh-756/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/twiml-ai-podcast/proactive-agents-for-the-web-with-devi-parikh-756.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
The future of web interaction lies in moving from manual clicking to high-level abstraction via proactive, autonomous agents. Devi Parikh explains how Yutori uses visually-grounded models to navigate the web more reliably than traditional DOM-based approaches.
Topics
- Proactive Agents
- Web Automation
- Computer Vision
- Multimodal Models
- Browser Use Models
- Autonomous Agents
- Yutori
- AI Agents
Highlights
- Main idea: Moving from DOM-based parsing to vision-based models provides much higher robustness against brittle web interfaces
- Technical approach: Yutori utilizes a training pipeline involving supervised fine-tuning, rejection sampling, and reinforcement learning
- Practical takeaway: Using 'Scouts' allows for ambient, background automation that monitors the web and reports findings without active user input
- Failure mode: Traditional browser automation often breaks due to edge cases in website structures, necessitating a shift toward visual grounding
- Future vision: The goal is to transition from simple information monitoring to complex, multi-step task automation that operates autonomously
Chapters
1:00The Evolution of Web Interaction: A look back at the progress in AI and the shift toward browser-use agents.9:15The Rise of Browser Agents: Discussing the excitement around automating web tasks and the potential for broader platforms.22:05Scaling Complex Workflows: How improving foundation models and custom training pipelines pushes the ceiling of agent capabilities.29:40Beyond Static Reports: Moving from simple data retrieval to interactive, actionable outputs from web agents.37:40The Shift to Vision-Based Navigation: Why relying on screenshots and visual grounding is more reliable than parsing the DOM.46:25Adaptive Orchestration: How 'Scouts' use adaptive plans and tool-use to execute complex, multi-step web tasks.50:30Ambient Agentic Systems: The concept of background agents that monitor the web 24/7 and notify users of significant events.