Episode

Bridging the AI–Data Gap: Collect, Curate, Serve

Podcast: Data Engineering Podcast
Published: Nov 2, 2025
Duration seconds: 3040
Processing state: processed
Canonical source: https://www.dataengineeringpodcast.com/bridging-the-data-ai-gap-episode-487
Audio: https://op3.dev/e/dts.podtrac.com/redirect.mp3/serve.podhome.fm/episode/f6ff0caa-931b-4c08-bfdd-08dc7f5cd336/6389770810242681066b292405-3006-49d2-930a-cafa13f672ed.mp3
JSON: /v1/public/podcasts/data-engineering-podcast/episodes/bridging-the-ai-data-gap-collect-curate-serve
Markdown: /podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve.md

Actions

POST https://stenobird.com/v1/public/podcasts/data-engineering-podcast/episodes/bridging-the-ai-data-gap-collect-curate-serve/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

The bottleneck in AI adoption isn't data collection, but the 'middle layer' of curation, semantics, and reliable serving. Upriver founders Omri Lifshitz and Ido Bronstein explain how to move beyond fragile POCs by building automated, deterministic workflows that bridge the gap between raw data and LLM context.

Topics

Data Engineering
Large Language Models
AI Agents
Data Curation
Data Pipelines
Semantic Modeling
Unstructured Data
Data Orchestration

Highlights

Main idea: The primary challenge in AI scaling is the 'middle layer'—the curation, semantics, and serving of data to agents
Failure mode: Relying on simple ETL tools for complex AI workloads creates inflexible infrastructure that cannot handle unstructured data or context windows
Practical takeaway: To move from POC to production, engineers must focus on creating reliable, deterministic pipelines that provide high-quality business context to LLMs
Main idea: AI agents require the same data quality as humans: high reliability, zero mistakes, and strong connection to business semantics
Future trend: Data engineering is shifting from managing granular pipelines to an architectural role, supervising business semantics while automation handles technical stitching

Chapters

1:10 The Complexity of Composable Infrastructure: The difficulty of managing fragmented data tools and the need for integrated governance.
5:10 The Two-Sided Data Demand: How AI simultaneously increases the supply of available data and the organizational demand for usable, high-quality datasets.
8:50 Beyond Structural Data: The shift from managing purely structural data to handling the complexities of unstructured data for AI agents.
12:40 Scaling from POC to Production: Addressing the reliability and productionization challenges inherent in deploying AI-driven data feeds.
16:20 The Semantic Requirement for Agents: Why AI agents need accurate business context and error-free data to be effective.
20:20 Leveraging Structured Data Models: How well-defined data models allow LLMs to capture and work effectively with organizational data.
24:10 Integrating Third-Party Data: The challenges and opportunities of connecting external web-scraped data with internal enterprise sources.