Episode
Bridging the AI–Data Gap: Collect, Curate, Serve
- Podcast
- Data Engineering Podcast
- Published
- Nov 2, 2025
- Duration seconds
- 3040
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/data-engineering-podcast/episodes/bridging-the-ai-data-gap-collect-curate-serve/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
The bottleneck in AI adoption isn't data collection, but the 'middle layer' of curation, semantics, and reliable serving. Upriver founders Omri Lifshitz and Ido Bronstein explain how to move beyond fragile POCs by building automated, deterministic workflows that bridge the gap between raw data and LLM context.
Topics
- Data Engineering
- Large Language Models
- AI Agents
- Data Curation
- Data Pipelines
- Semantic Modeling
- Unstructured Data
- Data Orchestration
Highlights
- Main idea: The primary challenge in AI scaling is the 'middle layer'—the curation, semantics, and serving of data to agents
- Failure mode: Relying on simple ETL tools for complex AI workloads creates inflexible infrastructure that cannot handle unstructured data or context windows
- Practical takeaway: To move from POC to production, engineers must focus on creating reliable, deterministic pipelines that provide high-quality business context to LLMs
- Main idea: AI agents require the same data quality as humans: high reliability, zero mistakes, and strong connection to business semantics
- Future trend: Data engineering is shifting from managing granular pipelines to an architectural role, supervising business semantics while automation handles technical stitching
Chapters
1:10The Complexity of Composable Infrastructure: The difficulty of managing fragmented data tools and the need for integrated governance.5:10The Two-Sided Data Demand: How AI simultaneously increases the supply of available data and the organizational demand for usable, high-quality datasets.8:50Beyond Structural Data: The shift from managing purely structural data to handling the complexities of unstructured data for AI agents.12:40Scaling from POC to Production: Addressing the reliability and productionization challenges inherent in deploying AI-driven data feeds.16:20The Semantic Requirement for Agents: Why AI agents need accurate business context and error-free data to be effective.20:20Leveraging Structured Data Models: How well-defined data models allow LLMs to capture and work effectively with organizational data.24:10Integrating Third-Party Data: The challenges and opportunities of connecting external web-scraped data with internal enterprise sources.