Episode
#241 - Building AI systems with quality, holistic data
- Podcast
- Data Futurology - Leadership And Strategy in Artificial Intelligence, Machine Learning, Data Science
- Published
- Jul 19, 2023
- Duration seconds
- 1796
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/data-futurology-leadership-and-strategy/episodes/241-building-ai-systems-with-quality-holistic-data/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/data-futurology-leadership-and-strategy/241-building-ai-systems-with-quality-holistic-data.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Unstructured data often contains critical, hidden information like PII or security threats that traditional systems miss. This presentation explores how advanced analytics and massive file-type support enable holistic data discovery and automated intelligence.
Topics
- Unstructured Data
- Machine Learning
- Natural Language Processing
- Computer Vision
- Data Governance
- Information Security
- Pattern Recognition
- Automated Intelligence
Highlights
- Main idea: Achieving holistic data discovery requires the ability to ingest and analyze a vast array of unstructured formats, including audio, video, and images
- Practical takeaway: Use automated pattern recognition in video feeds to detect anomalies, such as unusual traffic patterns indicating potential security threats
- Failure mode: Relying on standard web crawling for the dark web is ineffective; specialized dynamic corpus mapping is required to navigate fragmented, non-linear data
- Main idea: Advanced NLP and speech-to-text models must account for regional accents and dialects to maintain high accuracy in global deployments
- Practical takeaway: Implement automated redaction and alerting for sensitive data like driver's licenses or addresses found within unstructured text files
Chapters
3:10Introduction to Unstructured Analytics: Vinay Joseph introduces the challenges of managing unstructured data across various industry verticals.5:20Detecting PII in Unstructured Text: How to identify sensitive information like addresses and licenses hidden within file shares and web servers.7:40ML Functions and Data Ingestion: An overview of the ingestion suite, connectors, and the REST API layer for developer integration.9:50Automated Redaction and Indexing: Using SharePoint indexing to automatically detect, redact, and alert on sensitive document content.12:10Extensible Ingestion Pipelines: Integrating proprietary ECM systems and custom ingestion pipelines into existing workflows.14:20Navigating the Dark Web: Using dynamic corpus mapping to track stolen credentials and illicit marketplaces in fragmented environments.16:40Computer Vision and Drone Analytics: Applying image analytics to drone feeds to identify patterns of interest and potential threats.