# GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-05-21T04:36:22+00:00
Episode link: https://share.transistor.fm/s/dae8be06
Audio file: https://media.transistor.fm/dae8be06/8288569a.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment
Duration seconds: 1476

## Resource

🤗 Upvotes: 51 | cs.CL Authors: Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li Title: GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment Arxiv: http://arxiv.org/abs/2605.19577v1 Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight,…

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.