Episode

Flow-OPD: On-Policy Distillation for Flow Matching Models

Podcast: Daily Paper Cast
Published: May 12, 2026
Duration seconds: 1575
Processing state: not_requested
Canonical source: https://share.transistor.fm/s/56fff24a
Audio: https://media.transistor.fm/56fff24a/67b79d84.mp3
JSON: /v1/public/podcasts/daily-paper-cast-7079649/episodes/flow-opd-on-policy-distillation-for-flow-matching-models
Markdown: /podcast/daily-paper-cast-7079649/flow-opd-on-policy-distillation-for-flow-matching-models.md

Actions

POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/flow-opd-on-policy-distillation-for-flow-matching-models/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/daily-paper-cast-7079649/flow-opd-on-policy-distillation-for-flow-matching-models.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

🤗 Upvotes: 79 | cs.CV, cs.AI Authors: Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, Feng Zhao Title: Flow-OPD: On-Policy Distillation for Flow Matching Models Arxiv: http://arxiv.org/abs/2605.08063v1 Abstract: Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 po…