# [State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI Page: https://stenobird.com/podcast/latent-space-ai-engineer/state-of-post-training-from-gpt-4-1-to-5-1-rlvr-agent-token-efficiency-josh-mcgrath-openai Text version: https://stenobird.com/podcast/latent-space-ai-engineer/state-of-post-training-from-gpt-4-1-to-5-1-rlvr-agent-token-efficiency-josh-mcgrath-openai.md Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer) Published: 2025-12-31T14:00:00+00:00 Episode link: https://www.latent.space/p/state-of-post-training-from-gpt-41 Audio file: https://api.substack.com/feed/podcast/186610564/4944e1f91a0d0d17e5525fb297469684.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/state-of-post-training-from-gpt-4-1-to-5-1-rlvr-agent-token-efficiency-josh-mcgrath-openai Duration seconds: 1654 ## Resource From pre-training data curation to shipping GPT-4o , o1 , o3 , and now GPT-5 thinking and the shopping model , Josh McGrath has lived through the full arc of OpenAI’s post-training evolution—from the PPO vs DPO debates of 2023 to today’s RLVR era, where the real innovation isn’t optimization methods but data quality, signal trust, and token efficiency . We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can’t), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels “trapped” by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL (”way more moving parts than pre-training”), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn’t producing enough people who can do both distributed systems and ML research —the exact skill set required to push the frontier when the bottleneck moves every few weeks. We discuss: * Josh’s path: pre-training data curation → post-training researcher at OpenAI , shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model * Why he switched from pre-training to post-training: “Do I want to make 3% compute efficiency wins, or change be… ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/state-of-post-training-from-gpt-4-1-to-5-1-rlvr-agent-token-efficiency-josh-mcgrath-openai/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/state-of-post-training-from-gpt-4-1-to-5-1-rlvr-agent-token-efficiency-josh-mcgrath-openai.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.