{"podcast":{"title":"The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)","slug":"twiml-ai-podcast","podcast_index_feed_id":1045879,"rss_url":"https://feeds.megaphone.fm/MLN2155636147","website_url":"https://twimlai.com","image_url":"https://megaphone.imgix.net/podcasts/35230150-ee98-11eb-ad1a-b38cbabcd053/image/TWIML_AI_Podcast_Official_Cover_Art_1400px.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress","author":"TWIML","episode_count":785,"summary":"Machine learning and artificial intelligence are dramatically changing the way businesses operate and people live. The TWIML AI Podcast brings the top minds and ideas from the world of ML and AI to a broad and influential community of ML/AI researchers, data scientists, engineers and tech-savvy business and IT leaders. Hosted by Sam Charrington, a sought after industry analyst, speaker, commentator and thought leader. Technologies covered include machine learning, artificial intelligence, deep learning, natural language processing, neural networks, analytics, computer science, data science and more.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/twiml-ai-podcast"},"episode":{"title":"Why Vision Language Models Ignore What They See with Munawar Hayat - #758","slug":"why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758","published_at":"2025-12-09T19:46:00+00:00","page_url":"https://stenobird.com/podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758","show_page_url":"https://stenobird.com/podcast/twiml-ai-podcast","url":"https://twimlai.com/podcast/twimlai/why-vision-language-models-ignore-what-they-see/","audio_url":"https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7251543598.mp3?updated=1765310086","summary":"Vision-Language Models (VLMs) often suffer from hallucinations because they rely on language priors rather than actual visual input. This episode explores new research from Qualcomm AI Research on enforcing visual grounding and improving multimodal retrieval.","meta_description":"Explore why VLMs ignore visual data and how new research in attention-guided alignment and generalized contrastive learning can fix multimodal hallucinati…","key_points":["Main idea: VLMs frequently ignore visual tokens, instead relying on the language model's pre-trained parametric memory to answer questions","Failure mode: Models struggle with 'counting' and 'iterative reasoning' because they process images and text in a single end-to-end step without intermediate reflection","Practical takeaway: New attention-guided alignment techniques can force models to ground their responses in the actual visual features provided","Main idea: Generalized Contrastive Learning (GCL) enables complex retrieval tasks, such as searching for images using a combination of both text and image queries","Technical challenge: Generating multiple human subjects in generative models often leads to identity leakage and attribute blending between individuals"],"chapters":[{"start_ms":60000,"title":"Introduction to Qualcomm AI Research","summary":"Munawar Hayat introduces his background in computer vision and his current focus on multimodal generative AI at Qualcomm."},{"start_ms":625000,"title":"The Root of VLM Hallucinations","summary":"An analysis of why Vision-Language Models discard visual information in favor of linguistic priors during inference."},{"start_ms":1135000,"title":"Efficient Cross-Attention Architectures","summary":"A discussion on the computational complexity of injecting visual tokens via cross-attention modules versus simple concatenation."},{"start_ms":1915000,"title":"Composed Multimodal Retrieval","summary":"Exploring how to handle complex queries that use both text and images as keys for searching large image galleries."},{"start_ms":2435000,"title":"Addressing Identity Leakage in Human Generation","summary":"Introduction to the MultiHuman Testbench designed to measure and mitigate attribute blending in multi-person image generation."},{"start_ms":3165000,"title":"Efficient Inference and Long Context","summary":"A look at KV cache eviction and speculative decoding techniques for optimizing LLM and VLM performance on-device."},{"start_ms":3405000,"title":"Closing Remarks","summary":"Final thoughts on upcoming NeurIPS demos and the future of mobile-efficient AI."}],"topics":["Vision-Language Models","Multimodal AI","Hallucination Mitigation","Contrastive Learning","Generative AI","Qualcomm AI Research","NeurIPS","On-device AI"],"duration_seconds":3460,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}