Episode

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Podcast
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
Published
Feb 22, 2026
Duration seconds
3329
Processing state
processed
Canonical source
https://www.cognitiverevolution.ai/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post/
Audio
https://pdst.fm/e/mgln.ai/e/1113/pscrb.fm/rss/p/traffic.megaphone.fm/RINTP9245442386.mp3?updated=1771777343
JSON
/v1/public/podcasts/the-cognitive-revolution/episodes/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post
Markdown
/podcast/the-cognitive-revolution/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/the-cognitive-revolution/episodes/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/the-cognitive-revolution/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

MiniMax researcher Olive Song reveals how tight feedback loops between developers and researchers drive the training of the M-series frontier models. The discussion covers technical breakthroughs in reinforcement learning, including the necessity of FP32 precision to prevent implementation gaps.

Topics

  • Reinforcement Learning
  • Large Language Models
  • MiniMax
  • AI Agents
  • Model Alignment
  • FP32 Precision
  • Agentic Workflows
  • Machine Learning Engineering

Highlights

  • Main idea: MiniMax leverages a unique structure where researchers and application developers work side-by-side to create tight product feedback loops
  • Technical breakthrough: The team discovered that running reinforcement learning at FP32 precision was essential to bridge the gap between theoretical algorithms and real-world implementation
  • Failure mode: Reward hacking remains a constant battle, requiring systematic environment perturbations and robust alignment strategies to prevent models from finding shortcuts
  • Practical takeaway: Implementing 'interleaved thinking'—allowing models to pause and process environmental feedback—is key to mastering long-horizon agentic tasks
  • Research approach: MiniMax uses a first-principles approach to debugging, analyzing log probabilities layer-by-layer to diagnose why accuracy fails to scale

Chapters

  1. 1:00 Introduction to MiniMax and the M-series: An introduction to Olive Song and the development of the M-series models that lead the OpenRouter leaderboards.
  2. 5:20 The Developer-Researcher Feedback Loop: How having in-house developers provides precise rewards and evaluations for training foundation models.
  3. 13:20 Agent Generalization and Tool Scaling: Exploring the limits of tool scaling and the move toward more robust agentic capabilities.
  4. 17:15 The Engineering of Reinforcement Learning: A deep dive into the importance of engineering precision and the fight against reward hacking.
  5. 22:05 Debugging via Layer-by-Layer Analysis: The story of discovering implementation gaps by analyzing log probabilities at the layer level.
  6. 30:40 Alignment and Safety at Scale: How MiniMax handles large-scale alignment and safety evaluations before model launches.
  7. 35:30 Long-Horizon Agentic Tasks: Discussing the implementation of interleaved thinking for complex, multi-step tasks.
  8. 43:55 The Future of M2.2 and AGI: Looking ahead to improved multilingual coding and the ultimate goal of human-expert collaboration.