# SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow) Page: https://stenobird.com/podcast/latent-space-ai-engineer/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow Text version: https://stenobird.com/podcast/latent-space-ai-engineer/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow.md Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer) Published: 2025-12-18T16:00:00+00:00 Episode link: https://www.latent.space/p/sam-3-the-eyes-for-ai-nikhila-and Audio file: https://api.substack.com/feed/podcast/186610536/5ae34479547a797018eb92e7ffb4f660.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow Duration seconds: 4503 ## Resource As with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!) From SAM 1’s 11-million-image data engine to SAM 2’s memory-based video tracking, MSL’s Segment Anything project has redefined what’s possible in computer vision. Now SAM 3 takes the next leap: concept segmentation —prompting with natural language like “yellow school bus” or “tablecloth” to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio: SAM can now even segment audio output! We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token , why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception. We discuss: * What SAM 3 is: a unified model for concept-prompted segmentation, detectio… ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/sam-3-the-eyes-for-ai-nikhila-pengchuan-meta-superintelligence-ft-joseph-nelson-roboflow.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.