Episode

SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Podcast: Latent Space: The AI Engineer Podcast
Published: Dec 18, 2025
Duration seconds: 4503
Processing state: processed
Canonical source: https://www.latent.space/p/sam-3-the-eyes-for-ai-nikhila-and
Audio: https://api.substack.com/feed/podcast/186610536/5ae34479547a797018eb92e7ffb4f660.mp3
JSON: /v1/public/podcasts/latent-space-ai-engineer/episodes/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow
Markdown: /podcast/latent-space-ai-engineer/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow.md

Actions

POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/latent-space-ai-engineer/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

As with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!) From SAM 1’s 11-million-image data engine to SAM 2’s memory-based video tracking, MSL’s Segment Anything project has redefined what’s possible in computer vision. Now SAM 3 takes the next leap: concept segmentation —prompting with natural language like “yellow school bus” or “tablecloth” to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio: SAM can now even segment audio output! We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token , why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception. We discuss: * What SAM 3 is: a unified model for concept-prompted segmentation, detectio…