# KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/kvpo-ode-native-grpo-for-autoregressive-video-alignment-via-kv-semantic-exploration
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/kvpo-ode-native-grpo-for-autoregressive-video-alignment-via-kv-semantic-exploration.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-05-20T04:12:27+00:00
Episode link: https://share.transistor.fm/s/14186699
Audio file: https://media.transistor.fm/14186699/f3e59734.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/kvpo-ode-native-grpo-for-autoregressive-video-alignment-via-kv-semantic-exploration
Duration seconds: 1420

## Resource

🤗 Upvotes: 36 | cs.CV Authors: Ruicheng Zhang, Kaixi Cong, Jun Zhou, Zhizhou Zhong, Zunnan Xu, Shuiyang Mao, Wei Liu, Xiu Li Title: KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration Arxiv: http://arxiv.org/abs/2605.14278v1 Abstract: Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/kvpo-ode-native-grpo-for-autoregressive-video-alignment-via-kv-semantic-exploration/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/kvpo-ode-native-grpo-for-autoregressive-video-alignment-via-kv-semantic-exploration.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.