# MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models Page: https://stenobird.com/podcast/daily-paper-cast-7079649/memlens-benchmarking-multimodal-long-term-memory-in-large-vision-language-models Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/memlens-benchmarking-multimodal-long-term-memory-in-large-vision-language-models.md Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649) Published: 2026-05-16T04:25:28+00:00 Episode link: https://share.transistor.fm/s/25b78099 Audio file: https://media.transistor.fm/25b78099/830bc2f3.mp3 Processing state: not_requested JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/memlens-benchmarking-multimodal-long-term-memory-in-large-vision-language-models Duration seconds: 1617 ## Resource 🤗 Upvotes: 62 | cs.CV Authors: Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu, Xinlin Yang, Haoyue Feng, Wenjun Pan, Tianshi Zheng, Baixuan Xu, Zhengnan Li, Yangqiu Song, Ginny Wong, Simon See Title: MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models Arxiv: http://arxiv.org/abs/2605.14906v1 Abstract: Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github… ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/memlens-benchmarking-multimodal-long-term-memory-in-large-vision-language-models/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/memlens-benchmarking-multimodal-long-term-memory-in-large-vision-language-models.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.