# TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Page: https://stenobird.com/podcast/daily-paper-cast-7079649/trackcraft3r-repurposing-video-diffusion-transformers-for-dense-3d-tracking
Text version: https://stenobird.com/podcast/daily-paper-cast-7079649/trackcraft3r-repurposing-video-diffusion-transformers-for-dense-3d-tracking.md
Podcast: [Daily Paper Cast](https://stenobird.com/podcast/daily-paper-cast-7079649)
Published: 2026-05-15T04:59:50+00:00
Episode link: https://share.transistor.fm/s/7a4f5ed0
Audio file: https://media.transistor.fm/7a4f5ed0/7821fe9e.mp3
Processing state: not_requested
JSON: https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/trackcraft3r-repurposing-video-diffusion-transformers-for-dense-3d-tracking
Duration seconds: 1406

## Resource

🤗 Upvotes: 31 | cs.CV Authors: Jisu Nam, Jahyeok Koo, Soowon Son, Jaewoo Jung, Honggyu An, Junhwa Hur, Seungryong Kim Title: TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking Arxiv: http://arxiv.org/abs/2605.12587v1 Abstract: Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame g…

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/trackcraft3r-repurposing-video-diffusion-transformers-for-dense-3d-tracking/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/daily-paper-cast-7079649/trackcraft3r-repurposing-video-diffusion-transformers-for-dense-3d-tracking.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.