Episode

#241 - Building AI systems with quality, holistic data

Podcast
Data Futurology - Leadership And Strategy in Artificial Intelligence, Machine Learning, Data Science
Published
Jul 19, 2023
Duration seconds
1796
Processing state
processed
Canonical source
https://podcasters.spotify.com/pod/show/datafuturology/episodes/241---Building-AI-systems-with-quality--holistic-data-e273trk
Audio
https://anchor.fm/s/3fab060/podcast/play/73577780/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2023-6-19%2F339825247-44100-2-4e739dd411522.m4a
JSON
/v1/public/podcasts/data-futurology-leadership-and-strategy/episodes/241-building-ai-systems-with-quality-holistic-data
Markdown
/podcast/data-futurology-leadership-and-strategy/241-building-ai-systems-with-quality-holistic-data.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/data-futurology-leadership-and-strategy/episodes/241-building-ai-systems-with-quality-holistic-data/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/data-futurology-leadership-and-strategy/241-building-ai-systems-with-quality-holistic-data.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Unstructured data often contains critical, hidden information like PII or security threats that traditional systems miss. This presentation explores how advanced analytics and massive file-type support enable holistic data discovery and automated intelligence.

Topics

  • Unstructured Data
  • Machine Learning
  • Natural Language Processing
  • Computer Vision
  • Data Governance
  • Information Security
  • Pattern Recognition
  • Automated Intelligence

Highlights

  • Main idea: Achieving holistic data discovery requires the ability to ingest and analyze a vast array of unstructured formats, including audio, video, and images
  • Practical takeaway: Use automated pattern recognition in video feeds to detect anomalies, such as unusual traffic patterns indicating potential security threats
  • Failure mode: Relying on standard web crawling for the dark web is ineffective; specialized dynamic corpus mapping is required to navigate fragmented, non-linear data
  • Main idea: Advanced NLP and speech-to-text models must account for regional accents and dialects to maintain high accuracy in global deployments
  • Practical takeaway: Implement automated redaction and alerting for sensitive data like driver's licenses or addresses found within unstructured text files

Chapters

  1. 3:10 Introduction to Unstructured Analytics: Vinay Joseph introduces the challenges of managing unstructured data across various industry verticals.
  2. 5:20 Detecting PII in Unstructured Text: How to identify sensitive information like addresses and licenses hidden within file shares and web servers.
  3. 7:40 ML Functions and Data Ingestion: An overview of the ingestion suite, connectors, and the REST API layer for developer integration.
  4. 9:50 Automated Redaction and Indexing: Using SharePoint indexing to automatically detect, redact, and alert on sensitive document content.
  5. 12:10 Extensible Ingestion Pipelines: Integrating proprietary ECM systems and custom ingestion pipelines into existing workflows.
  6. 14:20 Navigating the Dark Web: Using dynamic corpus mapping to track stolen credentials and illicit marketplaces in fragmented environments.
  7. 16:40 Computer Vision and Drone Analytics: Applying image analytics to drone feeds to identify patterns of interest and potential threats.