# MMSkills: Towards Multimodal Skills for General Visual Agents

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/mmskills-towards-multimodal-skills-for-general-visual-agents
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/mmskills-towards-multimodal-skills-for-general-visual-agents.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-05-19T04:21:16+00:00
Episode link: https://share.transistor.fm/s/457d4957
Audio file: https://media.transistor.fm/457d4957/efcd8362.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/mmskills-towards-multimodal-skills-for-general-visual-agents
Duration seconds: 1360

## Resource

🤗 Upvotes: 104 | cs.AI Authors: Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu Title: MMSkills: Towards Multimodal Skills for General Visual Agents Arxiv: http://arxiv.org/abs/2605.13527v2 Abstract: Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporar…

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/mmskills-towards-multimodal-skills-for-general-visual-agents/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/mmskills-towards-multimodal-skills-for-general-visual-agents.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.