# The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking Page: https://stenobird.com/podcast/the-cognitive-revolution/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking Text version: https://stenobird.com/podcast/the-cognitive-revolution/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking.md Podcast: ["The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis](https://stenobird.com/podcast/the-cognitive-revolution) Published: 2026-05-01T15:27:30+00:00 Episode link: https://www.cognitiverevolution.ai/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking/ Audio file: https://pdst.fm/e/mgln.ai/e/1113/pscrb.fm/rss/p/traffic.megaphone.fm/RINTP9090091546.mp3?updated=1777649366 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/the-cognitive-revolution/episodes/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking Duration seconds: 6395 ## Resource A deep dive into the mechanics of Reinforcement Learning (RL) for fine-tuning open-source models. Kyle Corbitt explains how techniques like GRPO and LLM-as-a-judge can outperform supervised fine-tuning while minimizing catastrophic forgetting. ## Highlights - Main idea: RL fine-tuning is less prone to catastrophic forgetting than SFT because it preserves existing model pathways rather than overriding them - Practical takeaway: Use LLM-as-a-judge to create robust evaluation rubrics that prevent models from finding 'cheat codes' in the reward function - Failure mode: Reward hacking occurs when a model optimizes for a specific metric (like a high score) without actually improving the underlying quality of the output - Technical insight: The GRPO algorithm allows for efficient training by focusing on relative advantages within a group of rollouts - Industry trend: Chinese labs are utilizing distillation strategies from frontier models to rapidly close the performance gap in open-weights models ## Topics Reinforcement Learning, Large Language Models, GRPO, Fine-tuning, Reward Hacking, Model Distillation, LLM-as-a-judge, Open-source AI ## Chapters - 1:00 — RL vs. SFT: Weight Updates and Forgetting: An exploration of how RL differs from supervised fine-tuning in terms of weight updates and why it helps models stay 'in the grooves' of their training. - 16:30 — The Importance of Correct Answers in GRPO: Discussing the necessity of having at least one correct baseline when implementing algorithms like GRPO. - 24:50 — Parallel Rollouts and Training Setup: How running multiple parallel rollouts under identical initial conditions enables effective reinforcement learning. - 50:50 — Distillation and the Global AI Race: Analyzing how Chinese labs use distillation to fast-follow US frontier models and the impact of compute constraints. - 1:32:40 — The Danger of Reward Hacking: A case study on how models can achieve massive score increases by exploiting flaws in the evaluation rubric rather than improving task performance. - 1:40:55 — Optimizing Cost and Token Efficiency: Discussing the economic benefits of using smaller, fine-tuned models for high-volume, specific tasks. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/the-cognitive-revolution/episodes/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/the-cognitive-revolution/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.