Episode
Why Vision Language Models Ignore What They See with Munawar Hayat - #758
- Published
- Dec 9, 2025
- Duration seconds
- 3460
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Vision-Language Models (VLMs) often suffer from hallucinations because they rely on language priors rather than actual visual input. This episode explores new research from Qualcomm AI Research on enforcing visual grounding and improving multimodal retrieval.
Topics
- Vision-Language Models
- Multimodal AI
- Hallucination Mitigation
- Contrastive Learning
- Generative AI
- Qualcomm AI Research
- NeurIPS
- On-device AI
Highlights
- Main idea: VLMs frequently ignore visual tokens, instead relying on the language model's pre-trained parametric memory to answer questions
- Failure mode: Models struggle with 'counting' and 'iterative reasoning' because they process images and text in a single end-to-end step without intermediate reflection
- Practical takeaway: New attention-guided alignment techniques can force models to ground their responses in the actual visual features provided
- Main idea: Generalized Contrastive Learning (GCL) enables complex retrieval tasks, such as searching for images using a combination of both text and image queries
- Technical challenge: Generating multiple human subjects in generative models often leads to identity leakage and attribute blending between individuals
Chapters
1:00Introduction to Qualcomm AI Research: Munawar Hayat introduces his background in computer vision and his current focus on multimodal generative AI at Qualcomm.10:25The Root of VLM Hallucinations: An analysis of why Vision-Language Models discard visual information in favor of linguistic priors during inference.18:55Efficient Cross-Attention Architectures: A discussion on the computational complexity of injecting visual tokens via cross-attention modules versus simple concatenation.31:55Composed Multimodal Retrieval: Exploring how to handle complex queries that use both text and images as keys for searching large image galleries.40:35Addressing Identity Leakage in Human Generation: Introduction to the MultiHuman Testbench designed to measure and mitigate attribute blending in multi-person image generation.52:45Efficient Inference and Long Context: A look at KV cache eviction and speculative decoding techniques for optimizing LLM and VLM performance on-device.56:45Closing Remarks: Final thoughts on upcoming NeurIPS demos and the future of mobile-efficient AI.