{"podcast":{"title":"Practical AI","slug":"practical-ai","podcast_index_feed_id":444526,"rss_url":"https://feeds.transistor.fm/practical-ai-machine-learning-data-science-llm","website_url":"https://practicalai.fm","image_url":"https://img.transistorcdn.com/WMlp2ug34XB6LDJ3-vnzti_-_y144LUlFW0Xzzn3fss/rs:fill:0:0:1/w:1400/h:1400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS8wMTZi/ZWJmNWIwNDdmYTcw/NGJjMTExZjNjZmYy/M2ZjNS5wbmc.jpg","author":"Practical AI LLC","episode_count":357,"summary":"Making artificial intelligence practical, productive & accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, GANs, MLOps, AIOps, LLMs & more). The focus is on productive implementations and real-world scenarios that are accessible to everyone. If you want to keep up with the latest advances in AI, while keeping one foot in the real world, then this is the show for you!","last_synced_at":null,"page_url":"https://stenobird.com/podcast/practical-ai"},"episode":{"title":"Technical advances in document understanding","slug":"technical-advances-in-document-understanding","published_at":"2025-12-02T20:58:43+00:00","page_url":"https://stenobird.com/podcast/practical-ai/technical-advances-in-document-understanding","show_page_url":"https://stenobird.com/podcast/practical-ai","url":"https://share.transistor.fm/s/ba36c917","audio_url":"https://pscrb.fm/rss/p/dts.podtrac.com/redirect.mp3/media.transistor.fm/ba36c917/9f513f05.mp3","summary":"Explore the evolution of document understanding from traditional OCR to advanced vision-language models. Learn how modern architectures like DeepSeek-OCR and layout-aware parsers solve the structural fragmentation problem in RAG pipelines.","meta_description":"A deep dive into the technical shift from OCR to Vision-Language Models (VLMs) and the impact of document structure models on AI data extraction.","key_points":["Main idea: Document processing is moving from simple character recognition to complex structural understanding using layout primitives","Technical shift: Transitioning from LSTMs and CNNs to Transformers and Vision-Language Models (VLMs) allows for better context retention","Failure mode: Standard OCR often jumbles document context, breaking the logical flow of tables and multi-column layouts in RAG systems","Practical takeaway: Combining layout models like Docling with OCR can reconstruct documents into structured formats like Markdown or JSON","Future trend: New models like DeepSeek-OCR are addressing resolution limitations by using vision tokens to maintain global page context"],"chapters":[{"start_ms":285000,"title":"The Evolution of Document Processing","summary":"An overview of how document understanding has progressed from basic computer vision to modern AI-driven approaches."},{"start_ms":515000,"title":"OCR vs. Vision-Language Models","summary":"Comparing the mechanics of traditional Optical Character Recognition with the newer paradigm of Language-Vision Models (LVMs)."},{"start_ms":750000,"title":"Pixels to Probabilities","summary":"A technical look at how vision models process image pixels to output character probabilities similar to LLM token prediction."},{"start_ms":1175000,"title":"The Challenge of Input Resolution","summary":"Discussing the limitations of resizing images and how fixed-resolution inputs can degrade model performance."},{"start_ms":1395000,"title":"Document Structure and Layout Models","summary":"How models identify headings, paragraphs, and tables to create a structured tree representation of a document."},{"start_ms":1835000,"title":"Impact on RAG Systems","summary":"The consequences of losing document order and structure when feeding unstructured OCR text into retrieval-augmented generation."},{"start_ms":2080000,"title":"The Rise of Vision-Language Models","summary":"Exploring models that treat images and text as unified inputs to preserve semantic and spatial relationships."}],"topics":["OCR","Vision-Language Models","Document AI","RAG","DeepSeek-OCR","Computer Vision","Document Structure","Machine Learning"],"duration_seconds":2958,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/practical-ai/episodes/technical-advances-in-document-understanding/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/practical-ai/technical-advances-in-document-understanding.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}