# The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai) Page: https://stenobird.com/podcast/latent-space-ai-engineer/the-rlvr-revolution-with-nathan-lambert-ai2-interconnects-ai Text version: https://stenobird.com/podcast/latent-space-ai-engineer/the-rlvr-revolution-with-nathan-lambert-ai2-interconnects-ai.md Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer) Published: 2025-07-31T15:00:00+00:00 Episode link: https://www.latent.space/p/the-rlvr-revolution-with-nathan-lambert Audio file: https://api.substack.com/feed/podcast/186621771/524e0bea632d56947fcb7db8fc4c2238.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-rlvr-revolution-with-nathan-lambert-ai2-interconnects-ai Duration seconds: 4739 ## Resource We first had Nathan on to give us his RLHF deep dive when he was joining AI2 , and now he’s back to help us catch up on the evolution to RLVR (Reinforcement Learning with Verifiable Rewards), first proposed in his Tulu 3 paper. While RLHF remains foundational, RLVR has emerged as a powerful approach for training models on tasks with clear success criteria and using verifiable, objective functions as reward signals—particularly useful in domains like math, code correctness, and instruction-following. Instead of relying solely on subjective human feedback, RLVR leverages deterministic signals to guide optimization, making it more scalable and potentially more reliable across many domains. However, he notes that RLVR is still rapidly evolving, especially regarding how it handles tool use and multi-step reasoning. We also discussed the Tulu model series, a family of instruction-tuned open models developed at AI2. Tulu is designed to be a reproducible, state-of-the-art post-training recipe for the open community. Unlike frontier labs like OpenAI or Anthropic , which rely on vast and often proprietary datasets, Tulu aims to distill and democratize best practices for instruction and preference tuning. We are impressed with how small eval suites, careful task selection, and transparent methodology can rival even the best proprietary models on specific benchmarks. One of the most fascinating threads is the challenge of incorporating tool use into RL frameworks. Lambert highlights that while you can prompt a model to use tools like search or code execution, getting the model to reliably learn when and how to use them through RL is much harder . This is compounded by the difficulty of designing reward functions that avoid overoptimization—where models learn to “game” the reward s… ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/the-rlvr-revolution-with-nathan-lambert-ai2-interconnects-ai/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/the-rlvr-revolution-with-nathan-lambert-ai2-interconnects-ai.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.