# #241 - Building AI systems with quality, holistic data Page: https://stenobird.com/podcast/data-futurology-leadership-and-strategy/241-building-ai-systems-with-quality-holistic-data Text version: https://stenobird.com/podcast/data-futurology-leadership-and-strategy/241-building-ai-systems-with-quality-holistic-data.md Podcast: [Data Futurology - Leadership And Strategy in Artificial Intelligence, Machine Learning, Data Science](https://stenobird.com/podcast/data-futurology-leadership-and-strategy) Published: 2023-07-19T01:08:11+00:00 Episode link: https://podcasters.spotify.com/pod/show/datafuturology/episodes/241---Building-AI-systems-with-quality--holistic-data-e273trk Audio file: https://anchor.fm/s/3fab060/podcast/play/73577780/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2023-6-19%2F339825247-44100-2-4e739dd411522.m4a Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/data-futurology-leadership-and-strategy/episodes/241-building-ai-systems-with-quality-holistic-data Duration seconds: 1796 ## Resource Unstructured data often contains critical, hidden information like PII or security threats that traditional systems miss. This presentation explores how advanced analytics and massive file-type support enable holistic data discovery and automated intelligence. ## Highlights - Main idea: Achieving holistic data discovery requires the ability to ingest and analyze a vast array of unstructured formats, including audio, video, and images - Practical takeaway: Use automated pattern recognition in video feeds to detect anomalies, such as unusual traffic patterns indicating potential security threats - Failure mode: Relying on standard web crawling for the dark web is ineffective; specialized dynamic corpus mapping is required to navigate fragmented, non-linear data - Main idea: Advanced NLP and speech-to-text models must account for regional accents and dialects to maintain high accuracy in global deployments - Practical takeaway: Implement automated redaction and alerting for sensitive data like driver's licenses or addresses found within unstructured text files ## Topics Unstructured Data, Machine Learning, Natural Language Processing, Computer Vision, Data Governance, Information Security, Pattern Recognition, Automated Intelligence ## Chapters - 3:10 — Introduction to Unstructured Analytics: Vinay Joseph introduces the challenges of managing unstructured data across various industry verticals. - 5:20 — Detecting PII in Unstructured Text: How to identify sensitive information like addresses and licenses hidden within file shares and web servers. - 7:40 — ML Functions and Data Ingestion: An overview of the ingestion suite, connectors, and the REST API layer for developer integration. - 9:50 — Automated Redaction and Indexing: Using SharePoint indexing to automatically detect, redact, and alert on sensitive document content. - 12:10 — Extensible Ingestion Pipelines: Integrating proprietary ECM systems and custom ingestion pipelines into existing workflows. - 14:20 — Navigating the Dark Web: Using dynamic corpus mapping to track stolen credentials and illicit marketplaces in fragmented environments. - 16:40 — Computer Vision and Drone Analytics: Applying image analytics to drone feeds to identify patterns of interest and potential threats. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/data-futurology-leadership-and-strategy/episodes/241-building-ai-systems-with-quality-holistic-data/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/data-futurology-leadership-and-strategy/241-building-ai-systems-with-quality-holistic-data.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.