Episode

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Podcast: "The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
Published: Feb 22, 2026
Duration seconds: 3329
Processing state: processed
Canonical source: https://www.cognitiverevolution.ai/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post/
Audio: https://pdst.fm/e/mgln.ai/e/1113/pscrb.fm/rss/p/traffic.megaphone.fm/RINTP9245442386.mp3?updated=1771777343
JSON: /v1/public/podcasts/the-cognitive-revolution/episodes/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post
Markdown: /podcast/the-cognitive-revolution/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post.md

Actions

POST https://stenobird.com/v1/public/podcasts/the-cognitive-revolution/episodes/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/the-cognitive-revolution/intelligence-with-everyone-rl-minimax-with-olive-song-from-aie-nyc-inference-by-turing-post.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

MiniMax researcher Olive Song reveals how tight feedback loops between developers and researchers drive the training of the M-series frontier models. The discussion covers technical breakthroughs in reinforcement learning, including the necessity of FP32 precision to prevent implementation gaps.

Topics

Reinforcement Learning
Large Language Models
MiniMax
AI Agents
Model Alignment
FP32 Precision
Agentic Workflows
Machine Learning Engineering

Highlights

Main idea: MiniMax leverages a unique structure where researchers and application developers work side-by-side to create tight product feedback loops
Technical breakthrough: The team discovered that running reinforcement learning at FP32 precision was essential to bridge the gap between theoretical algorithms and real-world implementation
Failure mode: Reward hacking remains a constant battle, requiring systematic environment perturbations and robust alignment strategies to prevent models from finding shortcuts
Practical takeaway: Implementing 'interleaved thinking'—allowing models to pause and process environmental feedback—is key to mastering long-horizon agentic tasks
Research approach: MiniMax uses a first-principles approach to debugging, analyzing log probabilities layer-by-layer to diagnose why accuracy fails to scale

Chapters

1:00 Introduction to MiniMax and the M-series: An introduction to Olive Song and the development of the M-series models that lead the OpenRouter leaderboards.
5:20 The Developer-Researcher Feedback Loop: How having in-house developers provides precise rewards and evaluations for training foundation models.
13:20 Agent Generalization and Tool Scaling: Exploring the limits of tool scaling and the move toward more robust agentic capabilities.
17:15 The Engineering of Reinforcement Learning: A deep dive into the importance of engineering precision and the fight against reward hacking.
22:05 Debugging via Layer-by-Layer Analysis: The story of discovering implementation gaps by analyzing log probabilities at the layer level.
30:40 Alignment and Safety at Scale: How MiniMax handles large-scale alignment and safety evaluations before model launches.
35:30 Long-Horizon Agentic Tasks: Discussing the implementation of interleaved thinking for complex, multi-step tasks.
43:55 The Future of M2.2 and AGI: Looking ahead to improved multilingual coding and the ultimate goal of human-expert collaboration.