Episode
Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization
- Podcast
- Daily Paper Cast
- Published
- Jun 11, 2026
- Duration seconds
- 1468
- Processing state
not_requested- Canonical source
- https://share.transistor.fm/s/39e08487
Actions
POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/lip-forcing-few-step-autoregressive-diffusion-for-real-time-lip-synchronization/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/daily-paper-cast-7079649/lip-forcing-few-step-autoregressive-diffusion-for-real-time-lip-synchronization.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
🤗 Upvotes: 29 | cs.CV Authors: Paul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee, Siyoon Jin, Heeseong Shin, Jung Yi, Yunjin Park, Chulmin Park, Seungryong Kim Title: Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization Arxiv: http://arxiv.org/abs/2606.11180v1 Abstract: Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, $17.6\times$ faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs $39.8\times$ faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.