Episode

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

Podcast: The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Published: Dec 9, 2025
Duration seconds: 3460
Processing state: processed
Canonical source: https://twimlai.com/podcast/twimlai/why-vision-language-models-ignore-what-they-see/
Audio: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7251543598.mp3?updated=1765310086
JSON: /v1/public/podcasts/twiml-ai-podcast/episodes/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758
Markdown: /podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758.md

Actions

POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Vision-Language Models (VLMs) often suffer from hallucinations because they rely on language priors rather than actual visual input. This episode explores new research from Qualcomm AI Research on enforcing visual grounding and improving multimodal retrieval.

Topics

Vision-Language Models
Multimodal AI
Hallucination Mitigation
Contrastive Learning
Generative AI
Qualcomm AI Research
NeurIPS
On-device AI

Highlights

Main idea: VLMs frequently ignore visual tokens, instead relying on the language model's pre-trained parametric memory to answer questions
Failure mode: Models struggle with 'counting' and 'iterative reasoning' because they process images and text in a single end-to-end step without intermediate reflection
Practical takeaway: New attention-guided alignment techniques can force models to ground their responses in the actual visual features provided
Main idea: Generalized Contrastive Learning (GCL) enables complex retrieval tasks, such as searching for images using a combination of both text and image queries
Technical challenge: Generating multiple human subjects in generative models often leads to identity leakage and attribute blending between individuals

Chapters

1:00 Introduction to Qualcomm AI Research: Munawar Hayat introduces his background in computer vision and his current focus on multimodal generative AI at Qualcomm.
10:25 The Root of VLM Hallucinations: An analysis of why Vision-Language Models discard visual information in favor of linguistic priors during inference.
18:55 Efficient Cross-Attention Architectures: A discussion on the computational complexity of injecting visual tokens via cross-attention modules versus simple concatenation.
31:55 Composed Multimodal Retrieval: Exploring how to handle complex queries that use both text and images as keys for searching large image galleries.
40:35 Addressing Identity Leakage in Human Generation: Introduction to the MultiHuman Testbench designed to measure and mitigate attribute blending in multi-person image generation.
52:45 Efficient Inference and Long Context: A look at KV cache eviction and speculative decoding techniques for optimizing LLM and VLM performance on-device.
56:45 Closing Remarks: Final thoughts on upcoming NeurIPS demos and the future of mobile-efficient AI.