# RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/rubricem-meta-rl-with-rubric-guided-policy-decomposition-beyond-verifiable-rewards
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/rubricem-meta-rl-with-rubric-guided-policy-decomposition-beyond-verifiable-rewards.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-05-14T04:32:56+00:00
Episode link: https://share.transistor.fm/s/99f378e6
Audio file: https://media.transistor.fm/99f378e6/560b026b.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/rubricem-meta-rl-with-rubric-guided-policy-decomposition-beyond-verifiable-rewards
Duration seconds: 1357

## Resource

🤗 Upvotes: 66 | cs.CL, cs.LG Authors: Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister Title: RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards Arxiv: http://arxiv.org/abs/2605.10899v1 Abstract: Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough ana…

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/rubricem-meta-rl-with-rubric-guided-policy-decomposition-beyond-verifiable-rewards/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/rubricem-meta-rl-with-rubric-guided-policy-decomposition-beyond-verifiable-rewards.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.