# Flow-OPD: On-Policy Distillation for Flow Matching Models Page: https://stenobird.com/podcast/daily-paper-cast-7079649/flow-opd-on-policy-distillation-for-flow-matching-models Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/flow-opd-on-policy-distillation-for-flow-matching-models.md Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649) Published: 2026-05-12T04:03:07+00:00 Episode link: https://share.transistor.fm/s/56fff24a Audio file: https://media.transistor.fm/56fff24a/67b79d84.mp3 Processing state: not_requested JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/flow-opd-on-policy-distillation-for-flow-matching-models Duration seconds: 1575 ## Resource 🤗 Upvotes: 79 | cs.CV, cs.AI Authors: Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, Feng Zhao Title: Flow-OPD: On-Policy Distillation for Flow Matching Models Arxiv: http://arxiv.org/abs/2605.08063v1 Abstract: Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 po… ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/flow-opd-on-policy-distillation-for-flow-matching-models/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/flow-opd-on-policy-distillation-for-flow-matching-models.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.