# Your LLM issues are really data issues Page: https://stenobird.com/podcast/the-stack-overflow-podcast/your-llm-issues-are-really-data-issues Text version: https://stenobird.com/podcast/the-stack-overflow-podcast/your-llm-issues-are-really-data-issues.md Podcast: [The Stack Overflow Podcast](https://stenobird.com/podcast/the-stack-overflow-podcast) Published: 2026-04-28T04:00:00+00:00 Episode link: https://rss.art19.com/episodes/0a977ae3-f9b6-4920-9c66-c1e57cd2db8c.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0 Audio file: https://rss.art19.com/episodes/0a977ae3-f9b6-4920-9c66-c1e57cd2db8c.mp3?rss_browser=BAhJIg90cmFuc2NyaWJyBjoGRVQ%3D--952c5701c84ad333c69d5faa668f8177091704f0 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/your-llm-issues-are-really-data-issues Duration seconds: 1894 ## Resource LLMs fail in production environments not because of model limitations, but because of fragmented, undocumented, and inconsistent data ecosystems. To make AI truly useful, organizations must shift focus from model training to building a robust semantic metadata layer. ## Highlights - Main idea: LLM performance is bottlenecked by the lack of structured, documented metadata in production environments - Failure mode: Relying on manual documentation or human intervention to define data ownership and schemas fails to scale as data volume grows - Practical takeaway: Implement automated metadata scanning to capture ownership, lineage, and data quality signals without manual effort - Main idea: A semantic metadata graph allows LLMs to reason about relationships (e.g., what 'customer' means) without processing every raw data row - Practical takeaway: Use metadata as the interface for AI agents to navigate complex distributed systems like Snowflake, Kafka, or Hadoop ## Topics LLM implementation, Data Governance, Metadata Management, Semantic Web, Data Lineage, Big Data, AI Observability, Knowledge Graphs ## Chapters - 1:00 — The Journey from Hadoop to AI: A look at the evolution of big data processing and the transition from scaling indexing systems to managing modern AI use cases. - 3:10 — The Complexity of Distributed Systems: Discussing how cloud providers have solved distributed processing, but left the challenge of data accessibility and understanding unsolved. - 5:35 — The Data Ecosystem Bottleneck: Why throwing an LLM at a raw data ecosystem fails when columns lack context and definitions are inconsistent across companies. - 7:55 — The Problem of Data Silos: How the split between production and analytics databases creates standardization and naming conflicts. - 10:10 — The Human Element in Data Governance: The organizational challenge of maintaining documentation and the difficulty of scaling knowledge across large teams. - 12:25 — Solving the Semantic Problem: Addressing the ambiguity of business metrics, such as defining 'customer health,' through semantic intelligence. - 14:45 — Automating Metadata Discovery: The importance of automated scanning to capture ownership, lineage, and quality signals to ensure data is AI-ready. - 17:05 — Building a Knowledge Graph for Data: Using metadata schemas to map relationships between services, databases, and end-user dashboards. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/the-stack-overflow-podcast/episodes/your-llm-issues-are-really-data-issues/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/the-stack-overflow-podcast/your-llm-issues-are-really-data-issues.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.