Episode
Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post
- Published
- Feb 22, 2026
- Duration seconds
- 3329
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/the-cognitive-revolution/episodes/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/the-cognitive-revolution/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
MiniMax researcher Olive Song reveals how tight feedback loops between developers and researchers drive the training of the M-series frontier models. The discussion covers technical breakthroughs in reinforcement learning, including the necessity of FP32 precision to prevent implementation gaps.
Topics
- Reinforcement Learning
- Large Language Models
- MiniMax
- AI Agents
- Model Alignment
- FP32 Precision
- Agentic Workflows
- Machine Learning Engineering
Highlights
- Main idea: MiniMax leverages a unique structure where researchers and application developers work side-by-side to create tight product feedback loops
- Technical breakthrough: The team discovered that running reinforcement learning at FP32 precision was essential to bridge the gap between theoretical algorithms and real-world implementation
- Failure mode: Reward hacking remains a constant battle, requiring systematic environment perturbations and robust alignment strategies to prevent models from finding shortcuts
- Practical takeaway: Implementing 'interleaved thinking'—allowing models to pause and process environmental feedback—is key to mastering long-horizon agentic tasks
- Research approach: MiniMax uses a first-principles approach to debugging, analyzing log probabilities layer-by-layer to diagnose why accuracy fails to scale
Chapters
1:00Introduction to MiniMax and the M-series: An introduction to Olive Song and the development of the M-series models that lead the OpenRouter leaderboards.5:20The Developer-Researcher Feedback Loop: How having in-house developers provides precise rewards and evaluations for training foundation models.13:20Agent Generalization and Tool Scaling: Exploring the limits of tool scaling and the move toward more robust agentic capabilities.17:15The Engineering of Reinforcement Learning: A deep dive into the importance of engineering precision and the fight against reward hacking.22:05Debugging via Layer-by-Layer Analysis: The story of discovering implementation gaps by analyzing log probabilities at the layer level.30:40Alignment and Safety at Scale: How MiniMax handles large-scale alignment and safety evaluations before model launches.35:30Long-Horizon Agentic Tasks: Discussing the implementation of interleaved thinking for complex, multi-step tasks.43:55The Future of M2.2 and AGI: Looking ahead to improved multilingual coding and the ultimate goal of human-expert collaboration.