Episode

Qwen-Image-VAE-2.0 Technical Report

Podcast: Daily Paper Cast
Published: May 15, 2026
Duration seconds: 1462
Processing state: not_requested
Canonical source: https://share.transistor.fm/s/12f43ae7
Audio: https://media.transistor.fm/12f43ae7/11834c58.mp3
JSON: /v1/public/podcasts/daily-paper-cast-7079649/episodes/qwen-image-vae-2-0-technical-report
Markdown: /podcast/daily-paper-cast-7079649/qwen-image-vae-2-0-technical-report.md

Actions

POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/qwen-image-vae-2-0-technical-report/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/daily-paper-cast-7079649/qwen-image-vae-2-0-technical-report.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

🤗 Upvotes: 42 | cs.CV Authors: Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Yuxiang Chen, Zhendong Wang, Zihao Liu, Zikai Zhou, Yiliang Gu, Yi Wang, Xiaoxiao Xu, Lin Qu Title: Qwen-Image-VAE-2.0 Technical Report Arxiv: http://arxiv.org/abs/2605.13565v1 Abstract: We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT…