# From Web Video to Real-World Robots Page: https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/from-web-video-to-real-world-robots Text version: https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/from-web-video-to-real-world-robots.md Podcast: [The Data Exchange with Ben Lorica](https://stenobird.com/podcast/the-data-exchange-with-ben-lorica) Published: 2026-04-23T11:00:00+00:00 Episode link: https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/19021333-from-web-video-to-real-world-robots.mp3 Audio file: https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/19021333-from-web-video-to-real-world-robots.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/from-web-video-to-real-world-robots Duration seconds: 1867 ## Resource Rhoda AI is developing a vision-driven foundation model for robotics that decouples video prediction from action extraction. By pre-training on web-scale video, the model learns world dynamics, allowing robots to learn specific tasks with minimal physical interaction data. ## Highlights - Main idea: Robotics intelligence is shifting from text-based models to natively vision-driven models trained on web-scale video - Technical breakthrough: Decoupling video prediction from action extraction allows models to learn world dynamics from video alone, requiring only 10-20 hours of robot-specific data for fine-tuning - Practical takeaway: The primary challenge for deployment is achieving 99.9% reliability and integrating policy models into complex industrial environments - Failure mode: Scaling video models may hit a plateau where increasing parameters no longer yields significant quality improvements compared to LLMs - Future outlook: While dexterity remains a significant hurdle, the industry is moving toward 'Robot as a Service' models for repetitive human tasks like box folding and decanting ## Topics Robotics, Foundation Models, Computer Vision, World Models, Autonomous Systems, Machine Learning, Robot as a Service, Video Prediction ## Chapters - 1:00 — Defining the Robotics Intelligence Layer: Clarifying that the focus is on the intelligence layer for mobile robots and humanoids rather than just robotic arms. - 3:20 — The Ambiguity of World Models: Discussing the varying definitions of 'world models' across the research community and how Rhoda AI fits in. - 5:30 — Video Prediction as a Policy Model: Explaining how predicting the next frame in a video can serve as a foundation for robotic policy and action. - 7:50 — The Quest for 99.9% Reliability: Addressing the massive gap between current capabilities and the industrial standard for autonomous reliability. - 10:10 — Leveraging Multimodal Post-Training: How adding state and action data to vision models during post-training enables effective task execution. - 14:40 — Data Quality and Deepfakes: How the team filters web-scale video data and uses AI detection to ensure high-quality training sets. - 21:30 — Scaling Limits in Video Models: Preliminary findings on whether video models benefit from scaling in the same way large language models do. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/from-web-video-to-real-world-robots/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/from-web-video-to-real-world-robots.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.