Episode
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
- Podcast
- Daily Paper Cast
- Published
- May 15, 2026
- Duration seconds
- 1406
- Processing state
not_requested- Canonical source
- https://share.transistor.fm/s/7a4f5ed0
Actions
POST https://stenobird.com/v1/public/podcasts/daily-paper-cast-7079649/episodes/trackcraft3r-repurposing-video-diffusion-transformers-for-dense-3d-tracking/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/daily-paper-cast-7079649/trackcraft3r-repurposing-video-diffusion-transformers-for-dense-3d-tracking.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
🤗 Upvotes: 31 | cs.CV Authors: Jisu Nam, Jahyeok Koo, Soowon Son, Jaewoo Jung, Honggyu An, Junhwa Hur, Seungryong Kim Title: TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking Arxiv: http://arxiv.org/abs/2605.12587v1 Abstract: Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame g…