Episode

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Podcast
Daily Paper Cast
Published
May 20, 2026
Duration seconds
1345
Processing state
not_requested
Canonical source
https://share.transistor.fm/s/ef0f6ffc
Audio
https://media.transistor.fm/ef0f6ffc/220256a2.mp3
JSON
/v1/public/podcasts/daily-paper-cast-7079649/episodes/longlive-2-0-an-nvfp4-parallel-infrastructure-for-long-video-generation
Markdown
/podcast/daily-paper-cast-7079649/longlive-2-0-an-nvfp4-parallel-infrastructure-for-long-video-generation.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/longlive-2-0-an-nvfp4-parallel-infrastructure-for-long-video-generation/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/daily-paper-cast-7079649/longlive-2-0-an-nvfp4-parallel-infrastructure-for-long-video-generation.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

🤗 Upvotes: 92 | cs.CV, cs.DC Authors: Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han Title: LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation Arxiv: http://arxiv.org/abs/2605.18739v2 Abstract: We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while…