Episode
Why Video Agent models are next — Ethan He, xAI Grok Imagine
- Published
- Jun 1, 2026
- Duration seconds
- 6206
- Processing state
not_requested- Canonical source
- https://www.latent.space/p/video-agents
Actions
POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/why-video-agent-models-are-next-ethan-he-xai-grok-imagine/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/latent-space-ai-engineer/why-video-agent-models-are-next-ethan-he-xai-grok-imagine.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
We’re announcing AIEWF speakers this week! Take the AI Engineering Survey ! Today’s guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model , but then joined xAI and built Grok Imagine in 3 months: He comes back on Latent Space with some nuclear hot takes: that Video Models primarily get their intelligence from LLMs , not from training on video data, and that the next frontier for truly interactive, realtime, long-horizon world models is to work on LLMs (perhaps Interaction Models as well…) Put it this way: In the near term, the next Sora won’t be a better video model, but a video agent . Generative Media may more closely follow the evolution of AI coding which went from focusing on one-shot output performance and cost, to multiturn reasoning and planning models for agents and systems that can plan, edit, test, debug, and submit PRs. At a certain point, coding models got so good that the only significant next step to improve performance was handling the orchestration of these models. Now as the performance of video models increases significantly across realism, consistency, & prompt adherence while becoming more cost efficient, the next evolution of video generation may also be systems that can plan, generate, edit, critique, and iterate across an entire creative task. In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems : data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. From building NVIDIA’s Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal mo…