Episode

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Podcast
Daily Paper Cast
Published
May 21, 2026
Duration seconds
1476
Processing state
not_requested
Canonical source
https://share.transistor.fm/s/dae8be06
Audio
https://media.transistor.fm/dae8be06/8288569a.mp3
JSON
/v1/public/podcasts/daily-paper-cast-7079649/episodes/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment
Markdown
/podcast/daily-paper-cast-7079649/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/daily-paper-cast-7079649/golongrl-capability-oriented-long-context-reinforcement-learning-with-multitask-alignment.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

🤗 Upvotes: 51 | cs.CL Authors: Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li Title: GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment Arxiv: http://arxiv.org/abs/2605.19577v1 Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight,…