# Kwai Keye-VL-2.0 Technical Report

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/kwai-keye-vl-2-0-technical-report
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/kwai-keye-vl-2-0-technical-report.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-06-11T04:29:30+00:00
Episode link: https://share.transistor.fm/s/3be59bc5
Audio file: https://media.transistor.fm/3be59bc5/0b0d00a2.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/kwai-keye-vl-2-0-technical-report
Duration seconds: 1532

## Resource

🤗 Upvotes: 171 | cs.CV Authors: Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang Title: Kwai Keye-VL-2.0 Technical Report Arxiv: http://arxiv.org/abs/2606.10651v1 Abstract: We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level t…

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/kwai-keye-vl-2-0-technical-report/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/kwai-keye-vl-2-0-technical-report.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.