<video src=\"https://cdn-uploads.huggingface.co/production/uploads/67ff29ecbf6889a333c69c7a/HYwwqh0Vis7CrjhwONZhc.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-06-30T14:44:51.509Z","author":{"_id":"67ff29ecbf6889a333c69c7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ff29ecbf6889a333c69c7a/zilMQrxIgUKYvHBVCHaKL.jpeg","fullname":"Henghui Ding","name":"HenghuiDing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5076522827148438},"editors":["HenghuiDing"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67ff29ecbf6889a333c69c7a/zilMQrxIgUKYvHBVCHaKL.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.27339","authors":[{"_id":"6a3df0c73b43e283349ec1c1","name":"Ruiqi Shen","hidden":false},{"_id":"6a3df0c73b43e283349ec1c2","name":"Guangquan Jie","hidden":false},{"_id":"6a3df0c73b43e283349ec1c3","name":"Chang Liu","hidden":false},{"_id":"6a3df0c73b43e283349ec1c4","name":"Henghui Ding","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/67ff29ecbf6889a333c69c7a/Cvk3eFyeT9M0d7lDCw1iR.mp4"],"publishedAt":"2026-06-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"SAM2Matting: Generalized Image and Video Matting","submittedOnDailyBy":{"_id":"67ff29ecbf6889a333c69c7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ff29ecbf6889a333c69c7a/zilMQrxIgUKYvHBVCHaKL.jpeg","isPro":false,"fullname":"Henghui Ding","user":"HenghuiDing","type":"user","name":"HenghuiDing"},"summary":"Despite impressive advances in image matting, video matting remains challenging due to the inherent gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink the paradigm with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to high-fidelity video matting. Specifically, it decouples the task by enhancing a foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components resolve fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art performance on video matting, supports diverse prompt types, maintains strong temporal consistency, and demonstrates robust generalization across both human-centric and in-the-wild scenarios.","upvotes":6,"discussionId":"6a3df0c73b43e283349ec1c5","projectPage":"https://henghuiding.com/SAM2Matting/","githubRepo":"https://github.com/FudanCVL/SAM2Matting","githubRepoAddedBy":"user","ai_summary":"SAM2Matting advances video matting by decoupling tracking and matting tasks through a tracker-to-matting framework that leverages foundational trackers with region-proposal bridges and dedicated matting heads.","ai_keywords":["video matting","VOS trackers","SAM2","SAM3","region-proposal bridge","matting heads","temporal consistency","out-of-domain generalization","prompt types"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":27,"organization":{"_id":"68942389bd697013fd0c2df8","name":"FudanCVL","fullname":"FudanCVL","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ff29ecbf6889a333c69c7a/w_oRCf4rMPmNy62G-sI9p.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"687b1491392477cd3f670a78","avatarUrl":"/avatars/7189730a0e210040536a007c07887292.svg","isPro":false,"fullname":"Hongje Seong","user":"hongjeseong","type":"user"},{"_id":"67ff29ecbf6889a333c69c7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ff29ecbf6889a333c69c7a/zilMQrxIgUKYvHBVCHaKL.jpeg","isPro":false,"fullname":"Henghui Ding","user":"HenghuiDing","type":"user"},{"_id":"66d6f9c0b9e69dfa9b64efcb","avatarUrl":"/avatars/00e59deb409cc15585abf021e59c3611.svg","isPro":false,"fullname":"JinyuLiu","user":"JinyuLiu","type":"user"},{"_id":"65be40ed0a0c57943fc73a85","avatarUrl":"/avatars/96200772bee143bca3ccc6b7d3130d75.svg","isPro":false,"fullname":"Axe","user":"SongTang","type":"user"},{"_id":"66216897c92239f49974a07e","avatarUrl":"/avatars/def610a1f68bbbc80df945fe81b12ce0.svg","isPro":false,"fullname":"Jason Shen","user":"jasonshen-sh","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68942389bd697013fd0c2df8","name":"FudanCVL","fullname":"FudanCVL","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ff29ecbf6889a333c69c7a/w_oRCf4rMPmNy62G-sI9p.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.27339.md","query":{}}">
SAM2Matting: Generalized Image and Video Matting
Abstract
SAM2Matting advances video matting by decoupling tracking and matting tasks through a tracker-to-matting framework that leverages foundational trackers with region-proposal bridges and dedicated matting heads.
Despite impressive advances in image matting, video matting remains challenging due to the inherent gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink the paradigm with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to high-fidelity video matting. Specifically, it decouples the task by enhancing a foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components resolve fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art performance on video matting, supports diverse prompt types, maintains strong temporal consistency, and demonstrates robust generalization across both human-centric and in-the-wild scenarios.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.27339 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.27339 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.27339 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.