# Better Data is All You Need — Ari Morcos, Datology Page: https://stenobird.com/podcast/latent-space-ai-engineer/better-data-is-all-you-need-ari-morcos-datology Text version: https://stenobird.com/podcast/latent-space-ai-engineer/better-data-is-all-you-need-ari-morcos-datology.md Podcast: [Latent Space: The AI Engineer Podcast](https://stenobird.com/podcast/latent-space-ai-engineer) Published: 2025-08-29T15:00:00+00:00 Episode link: https://www.latent.space/p/better-data-is-all-you-need-ari-morcos Audio file: https://api.substack.com/feed/podcast/186621779/02cbea33ecbd1e1764dc8be8ad8bce9a.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/better-data-is-all-you-need-ari-morcos-datology Duration seconds: 4723 ## Resource Our chat with Ari shows that data curation is the most impactful and underinvested area in AI . He argues that the prevailing focus on model architecture and compute scaling overlooks the “bitter lesson” that “models are what they eat.” Effective data curation—a sophisticated process involving filtering, rebalancing, sequencing (curriculum), and synthetic data generation—allows for training models that are simultaneously faster, better, and smaller . Morcos recounts his personal journey from focusing on model-centric inductive biases to realizing that data quality is the primary lever for breaking the diminishing returns of naive scaling laws. Datology’s mission is to automate this complex curation process, making state-of-the-art data accessible to any organization and enabling a new paradigm of AI development where data efficiency, not just raw scale, drives progress. Full Video Episode Timestamps 00:00 Introduction 00:46 What is Datology? The mission to train models faster, better, and smaller through data curation. 01:59 Ari’s background: From neuroscience to realizing the “Bitter Lesson” of AI. 05:30 Key Insight: Inductive biases from architecture become less important and even harmful as data scale increases. 08:08 Thesis: Data is the most underinvested area of AI research relative to its impact. 10:15 Why data work is culturally undervalued in research and industry. 12:19 How self-supervised learning changed everything, moving from a data-scarce to a data-abundant regime. 17:05 Why automated curation is superior to human-in-the-loop, citing the DCLM study. 19:22 The “Elephants vs. Dogs” analogy for managing data redundancy and complexity. 22:46 A brief history and commentary on key datasets (Common Crawl, GitHub, Books3). 26:24 Breaking naive scaling laws by imp… ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/better-data-is-all-you-need-ari-morcos-datology/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/latent-space-ai-engineer/better-data-is-all-you-need-ari-morcos-datology.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.