# Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/video2gui-synthesizing-large-scale-interaction-trajectories-for-generalized-gui-agent-pretraining
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/video2gui-synthesizing-large-scale-interaction-trajectories-for-generalized-gui-agent-pretraining.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-05-22T04:02:35+00:00
Episode link: https://share.transistor.fm/s/8e5e5cca
Audio file: https://media.transistor.fm/8e5e5cca/be797ae6.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/video2gui-synthesizing-large-scale-interaction-trajectories-for-generalized-gui-agent-pretraining
Duration seconds: 1171

## Resource

🤗 Upvotes: 121 | cs.CL, cs.AI, cs.CV, cs.LG Authors: Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian Title: Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining Arxiv: http://arxiv.org/abs/2605.14747v1 Abstract: Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/video2gui-synthesizing-large-scale-interaction-trajectories-for-generalized-gui-agent-pretraining/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/video2gui-synthesizing-large-scale-interaction-trajectories-for-generalized-gui-agent-pretraining.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.