# Why Vision Language Models Ignore What They See with Munawar Hayat - #758

Page: https://stenobird.com/podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758
Text version: https://stenobird.com/podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758.md
Podcast: [The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)](https://stenobird.com/podcast/twiml-ai-podcast)
Published: 2025-12-09T19:46:00+00:00
Episode link: https://twimlai.com/podcast/twimlai/why-vision-language-models-ignore-what-they-see/
Audio file: https://pscrb.fm/rss/p/traffic.megaphone.fm/MLN7251543598.mp3?updated=1765310086
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758
Duration seconds: 3460

## Resource

Vision-Language Models (VLMs) often suffer from hallucinations because they rely on language priors rather than actual visual input. This episode explores new research from Qualcomm AI Research on enforcing visual grounding and improving multimodal retrieval.

## Highlights
- Main idea: VLMs frequently ignore visual tokens, instead relying on the language model's pre-trained parametric memory to answer questions
- Failure mode: Models struggle with 'counting' and 'iterative reasoning' because they process images and text in a single end-to-end step without intermediate reflection
- Practical takeaway: New attention-guided alignment techniques can force models to ground their responses in the actual visual features provided
- Main idea: Generalized Contrastive Learning (GCL) enables complex retrieval tasks, such as searching for images using a combination of both text and image queries
- Technical challenge: Generating multiple human subjects in generative models often leads to identity leakage and attribute blending between individuals

## Topics

Vision-Language Models, Multimodal AI, Hallucination Mitigation, Contrastive Learning, Generative AI, Qualcomm AI Research, NeurIPS, On-device AI

## Chapters
- 1:00 — Introduction to Qualcomm AI Research: Munawar Hayat introduces his background in computer vision and his current focus on multimodal generative AI at Qualcomm.
- 10:25 — The Root of VLM Hallucinations: An analysis of why Vision-Language Models discard visual information in favor of linguistic priors during inference.
- 18:55 — Efficient Cross-Attention Architectures: A discussion on the computational complexity of injecting visual tokens via cross-attention modules versus simple concatenation.
- 31:55 — Composed Multimodal Retrieval: Exploring how to handle complex queries that use both text and images as keys for searching large image galleries.
- 40:35 — Addressing Identity Leakage in Human Generation: Introduction to the MultiHuman Testbench designed to measure and mitigate attribute blending in multi-person image generation.
- 52:45 — Efficient Inference and Long Context: A look at KV cache eviction and speculative decoding techniques for optimizing LLM and VLM performance on-device.
- 56:45 — Closing Remarks: Final thoughts on upcoming NeurIPS demos and the future of mobile-efficient AI.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/twiml-ai-podcast/episodes/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/twiml-ai-podcast/why-vision-language-models-ignore-what-they-see-with-munawar-hayat-758.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.