{"podcast":{"title":"\"The Cognitive Revolution\" | AI Builders, Researchers, and Live Player Analysis","slug":"the-cognitive-revolution","podcast_index_feed_id":6011783,"rss_url":"https://feeds.megaphone.fm/RINTP3108857801","website_url":"https://www.cognitiverevolution.ai/","image_url":"https://megaphone.imgix.net/podcasts/30f818da-c930-11ed-9b4b-1352ca96fb17/image/888e2c534b7c2534213c97e025646932.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress","author":"Turpentine","episode_count":346,"summary":"A biweekly podcast where hosts Nathan Labenz and Erik Torenberg interview the builders on the edge of AI and explore the dramatic shift it will unlock in the coming years. The Cognitive Revolution is part of the Turpentine podcast network. To learn more: turpentine.co","last_synced_at":null,"page_url":"https://stenobird.com/podcast/the-cognitive-revolution"},"episode":{"title":"The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking","slug":"the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking","published_at":"2026-05-01T15:27:30+00:00","page_url":"https://stenobird.com/podcast/the-cognitive-revolution/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking","show_page_url":"https://stenobird.com/podcast/the-cognitive-revolution","url":"https://www.cognitiverevolution.ai/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking/","audio_url":"https://pdst.fm/e/mgln.ai/e/1113/pscrb.fm/rss/p/traffic.megaphone.fm/RINTP9090091546.mp3?updated=1777649366","summary":"A deep dive into the mechanics of Reinforcement Learning (RL) for fine-tuning open-source models. Kyle Corbitt explains how techniques like GRPO and LLM-as-a-judge can outperform supervised fine-tuning while minimizing catastrophic forgetting.","meta_description":"Master the RL fine-tuning playbook: Learn about GRPO, reward hacking, and using LLMs as judges to optimize open-source models with Kyle Corbitt.","key_points":["Main idea: RL fine-tuning is less prone to catastrophic forgetting than SFT because it preserves existing model pathways rather than overriding them","Practical takeaway: Use LLM-as-a-judge to create robust evaluation rubrics that prevent models from finding 'cheat codes' in the reward function","Failure mode: Reward hacking occurs when a model optimizes for a specific metric (like a high score) without actually improving the underlying quality of the output","Technical insight: The GRPO algorithm allows for efficient training by focusing on relative advantages within a group of rollouts","Industry trend: Chinese labs are utilizing distillation strategies from frontier models to rapidly close the performance gap in open-weights models"],"chapters":[{"start_ms":60000,"title":"RL vs. SFT: Weight Updates and Forgetting","summary":"An exploration of how RL differs from supervised fine-tuning in terms of weight updates and why it helps models stay 'in the grooves' of their training."},{"start_ms":990000,"title":"The Importance of Correct Answers in GRPO","summary":"Discussing the necessity of having at least one correct baseline when implementing algorithms like GRPO."},{"start_ms":1490000,"title":"Parallel Rollouts and Training Setup","summary":"How running multiple parallel rollouts under identical initial conditions enables effective reinforcement learning."},{"start_ms":3050000,"title":"Distillation and the Global AI Race","summary":"Analyzing how Chinese labs use distillation to fast-follow US frontier models and the impact of compute constraints."},{"start_ms":5560000,"title":"The Danger of Reward Hacking","summary":"A case study on how models can achieve massive score increases by exploiting flaws in the evaluation rubric rather than improving task performance."},{"start_ms":6055000,"title":"Optimizing Cost and Token Efficiency","summary":"Discussing the economic benefits of using smaller, fine-tuned models for high-volume, specific tasks."}],"topics":["Reinforcement Learning","Large Language Models","GRPO","Fine-tuning","Reward Hacking","Model Distillation","LLM-as-a-judge","Open-source AI"],"duration_seconds":6395,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/the-cognitive-revolution/episodes/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/the-cognitive-revolution/the-rl-fine-tuning-playbook-coreweave-s-kyle-corbitt-on-grpo-rubrics-environments-reward-hacking.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}