Hugging Face Daily Papers · · 5 min read

The Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI.</p>\n","updatedAt":"2026-06-30T07:09:41.217Z","author":{"_id":"645dd4a058f9ee3151493022","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645dd4a058f9ee3151493022/2r0tgS90ww1vcQDLKbWCl.jpeg","fullname":"Yufei Liu","name":"ggxxii","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8952619433403015},"editors":["ggxxii"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/645dd4a058f9ee3151493022/2r0tgS90ww1vcQDLKbWCl.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30308","authors":[{"_id":"6a4369d4763f63ca3757eae5","name":"Yuxi Wang","hidden":false},{"_id":"6a4369d4763f63ca3757eae6","name":"Chengkai Jin","hidden":false},{"_id":"6a4369d4763f63ca3757eae7","name":"Yufei Liu","hidden":false},{"_id":"6a4369d4763f63ca3757eae8","name":"Wenqi Ouyang","hidden":false},{"_id":"6a4369d4763f63ca3757eae9","name":"Tianyi Wei","hidden":false},{"_id":"6a4369d4763f63ca3757eaea","name":"Zhiwei Zeng","hidden":false},{"_id":"6a4369d4763f63ca3757eaeb","name":"Siyuan Huang","hidden":false},{"_id":"6a4369d4763f63ca3757eaec","name":"Zhiqi Shen","hidden":false},{"_id":"6a4369d4763f63ca3757eaed","name":"Xingang Pan","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"The Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction","submittedOnDailyBy":{"_id":"645dd4a058f9ee3151493022","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645dd4a058f9ee3151493022/2r0tgS90ww1vcQDLKbWCl.jpeg","isPro":false,"fullname":"Yufei Liu","user":"ggxxii","type":"user","name":"ggxxii"},"summary":"4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.","upvotes":2,"discussionId":"6a4369d4763f63ca3757eaee","projectPage":"https://vidihand.github.io/","githubRepo":"https://github.com/NTUYWANG103/ViDiHand","githubRepoAddedBy":"user","ai_summary":"ViDiHand uses pretrained video diffusion model representations with hand-overlay rendering to reconstruct 4D hand motion directly from video frames without detectors or optimization.","ai_keywords":["video diffusion models","hand-overlay rendering","4D hand motion reconstruction","egocentric video","video generative models","temporal modules","hand-pose annotations","metric-scale pose","full frames","pretrained models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5,"organization":{"_id":"6371470aafbe42caa5a76208","name":"nanyang-technological-university-singapore","fullname":"Nanyang Technological University Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637146c5afbe42caa5a75e1b/sZyHSA1AQaAS4nrGan682.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"645dd4a058f9ee3151493022","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645dd4a058f9ee3151493022/2r0tgS90ww1vcQDLKbWCl.jpeg","isPro":false,"fullname":"Yufei Liu","user":"ggxxii","type":"user"},{"_id":"63c825f1ecdb7c9fdd992a87","avatarUrl":"/avatars/57921441bc9640c42cdd2481a8c90b88.svg","isPro":false,"fullname":"WANGYUXI","user":"YWANG103","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6371470aafbe42caa5a76208","name":"nanyang-technological-university-singapore","fullname":"Nanyang Technological University Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637146c5afbe42caa5a75e1b/sZyHSA1AQaAS4nrGan682.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30308.md","query":{}}">
Papers
arxiv:2606.30308

The Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction

Published on Jun 29
· Submitted by
Yufei Liu
on Jun 30
Authors:
,
,
,
,
,
,
,
,

Abstract

ViDiHand uses pretrained video diffusion model representations with hand-overlay rendering to reconstruct 4D hand motion directly from video frames without detectors or optimization.

4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.

Community

Paper submitter about 18 hours ago

4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.30308
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.30308 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.30308 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.30308 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers