Episode

Bridging the AI–Data Gap: Collect, Curate, Serve

Podcast
Data Engineering Podcast
Published
Nov 2, 2025
Duration seconds
3040
Processing state
processed
Canonical source
https://www.dataengineeringpodcast.com/bridging-the-data-ai-gap-episode-487
Audio
https://op3.dev/e/dts.podtrac.com/redirect.mp3/serve.podhome.fm/episode/f6ff0caa-931b-4c08-bfdd-08dc7f5cd336/6389770810242681066b292405-3006-49d2-930a-cafa13f672ed.mp3
JSON
/v1/public/podcasts/data-engineering-podcast/episodes/bridging-the-ai-data-gap-collect-curate-serve
Markdown
/podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/data-engineering-podcast/episodes/bridging-the-ai-data-gap-collect-curate-serve/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

The bottleneck in AI adoption isn't data collection, but the 'middle layer' of curation, semantics, and reliable serving. Upriver founders Omri Lifshitz and Ido Bronstein explain how to move beyond fragile POCs by building automated, deterministic workflows that bridge the gap between raw data and LLM context.

Topics

  • Data Engineering
  • Large Language Models
  • AI Agents
  • Data Curation
  • Data Pipelines
  • Semantic Modeling
  • Unstructured Data
  • Data Orchestration

Highlights

  • Main idea: The primary challenge in AI scaling is the 'middle layer'—the curation, semantics, and serving of data to agents
  • Failure mode: Relying on simple ETL tools for complex AI workloads creates inflexible infrastructure that cannot handle unstructured data or context windows
  • Practical takeaway: To move from POC to production, engineers must focus on creating reliable, deterministic pipelines that provide high-quality business context to LLMs
  • Main idea: AI agents require the same data quality as humans: high reliability, zero mistakes, and strong connection to business semantics
  • Future trend: Data engineering is shifting from managing granular pipelines to an architectural role, supervising business semantics while automation handles technical stitching

Chapters

  1. 1:10 The Complexity of Composable Infrastructure: The difficulty of managing fragmented data tools and the need for integrated governance.
  2. 5:10 The Two-Sided Data Demand: How AI simultaneously increases the supply of available data and the organizational demand for usable, high-quality datasets.
  3. 8:50 Beyond Structural Data: The shift from managing purely structural data to handling the complexities of unstructured data for AI agents.
  4. 12:40 Scaling from POC to Production: Addressing the reliability and productionization challenges inherent in deploying AI-driven data feeds.
  5. 16:20 The Semantic Requirement for Agents: Why AI agents need accurate business context and error-free data to be effective.
  6. 20:20 Leveraging Structured Data Models: How well-defined data models allow LLMs to capture and work effectively with organizational data.
  7. 24:10 Integrating Third-Party Data: The challenges and opportunities of connecting external web-scraped data with internal enterprise sources.