# Bridging the AI–Data Gap: Collect, Curate, Serve

Page: https://stenobird.com/podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve
Text version: https://stenobird.com/podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve.md
Podcast: [Data Engineering Podcast](https://stenobird.com/podcast/data-engineering-podcast)
Published: 2025-11-02T19:31:17+00:00
Episode link: https://www.dataengineeringpodcast.com/bridging-the-data-ai-gap-episode-487
Audio file: https://op3.dev/e/dts.podtrac.com/redirect.mp3/serve.podhome.fm/episode/f6ff0caa-931b-4c08-bfdd-08dc7f5cd336/6389770810242681066b292405-3006-49d2-930a-cafa13f672ed.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/data-engineering-podcast/episodes/bridging-the-ai-data-gap-collect-curate-serve
Duration seconds: 3040

## Resource

The bottleneck in AI adoption isn't data collection, but the 'middle layer' of curation, semantics, and reliable serving. Upriver founders Omri Lifshitz and Ido Bronstein explain how to move beyond fragile POCs by building automated, deterministic workflows that bridge the gap between raw data and LLM context.

## Highlights
- Main idea: The primary challenge in AI scaling is the 'middle layer'—the curation, semantics, and serving of data to agents
- Failure mode: Relying on simple ETL tools for complex AI workloads creates inflexible infrastructure that cannot handle unstructured data or context windows
- Practical takeaway: To move from POC to production, engineers must focus on creating reliable, deterministic pipelines that provide high-quality business context to LLMs
- Main idea: AI agents require the same data quality as humans: high reliability, zero mistakes, and strong connection to business semantics
- Future trend: Data engineering is shifting from managing granular pipelines to an architectural role, supervising business semantics while automation handles technical stitching

## Topics

Data Engineering, Large Language Models, AI Agents, Data Curation, Data Pipelines, Semantic Modeling, Unstructured Data, Data Orchestration

## Chapters
- 1:10 — The Complexity of Composable Infrastructure: The difficulty of managing fragmented data tools and the need for integrated governance.
- 5:10 — The Two-Sided Data Demand: How AI simultaneously increases the supply of available data and the organizational demand for usable, high-quality datasets.
- 8:50 — Beyond Structural Data: The shift from managing purely structural data to handling the complexities of unstructured data for AI agents.
- 12:40 — Scaling from POC to Production: Addressing the reliability and productionization challenges inherent in deploying AI-driven data feeds.
- 16:20 — The Semantic Requirement for Agents: Why AI agents need accurate business context and error-free data to be effective.
- 20:20 — Leveraging Structured Data Models: How well-defined data models allow LLMs to capture and work effectively with organizational data.
- 24:10 — Integrating Third-Party Data: The challenges and opportunities of connecting external web-scraped data with internal enterprise sources.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/data-engineering-podcast/episodes/bridging-the-ai-data-gap-collect-curate-serve/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/data-engineering-podcast/bridging-the-ai-data-gap-collect-curate-serve.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.