Hugging Face Daily Papers · · 6 min read

Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

A feed-forward framework decomposes 3D scenes into instance-structured token groups from multi-view images, enabling direct object-level reconstruction, segmentation, and manipulation without 3D annotations.</p>\n","updatedAt":"2026-07-01T08:54:34.796Z","author":{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png","fullname":"Mijin Yoo","name":"mynameisyoomimi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7665735483169556},"editors":["mynameisyoomimi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png"],"reactions":[],"isReport":false}},{"id":"6a45c33ee6a31bbaf1961937","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting](https://huggingface.co/papers/2605.04506) (2026)\n* [OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives](https://huggingface.co/papers/2605.20044) (2026)\n* [ZipSplat: Fewer Gaussians, Better Splats](https://huggingface.co/papers/2606.05102) (2026)\n* [Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction](https://huggingface.co/papers/2605.31595) (2026)\n* [OCH3R: Object-Centric Holistic 3D Reconstruction](https://huggingface.co/papers/2605.13018) (2026)\n* [LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images](https://huggingface.co/papers/2605.23287) (2026)\n* [ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views](https://huggingface.co/papers/2605.24304) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.04506\">Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20044\">OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.05102\">ZipSplat: Fewer Gaussians, Better Splats</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.31595\">Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.13018\">OCH3R: Object-Centric Holistic 3D Reconstruction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23287\">LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.24304\">ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:47:42.539Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6701268553733826},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.29513","authors":[{"_id":"6a436bcd763f63ca3757eafb","user":{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png","isPro":false,"fullname":"Mijin Yoo","user":"mynameisyoomimi","type":"user","name":"mynameisyoomimi"},"name":"Mijin Yoo","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:46:14.563Z","hidden":false},{"_id":"6a436bcd763f63ca3757eafc","name":"In Cho","hidden":false},{"_id":"6a436bcd763f63ca3757eafd","name":"Subin Jeon","hidden":false},{"_id":"6a436bcd763f63ca3757eafe","name":"Jiwoo Lee","hidden":false},{"_id":"6a436bcd763f63ca3757eaff","name":"Eunbyung Park","hidden":false},{"_id":"6a436bcd763f63ca3757eb00","name":"Seon Joo Kim","hidden":false}],"publishedAt":"2026-06-28T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views","submittedOnDailyBy":{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png","isPro":false,"fullname":"Mijin Yoo","user":"mynameisyoomimi","type":"user","name":"mynameisyoomimi"},"summary":"A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images -- compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing -- removing, translating, or inserting objects by operating on their groups -- as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.","upvotes":36,"discussionId":"6a436bcd763f63ca3757eb01","projectPage":"https://yoomimi.github.io/instok3d","ai_summary":"A feed-forward framework decomposes 3D scenes into instance-structured token groups from multi-view images, enabling direct object-level reconstruction, segmentation, and manipulation without 3D annotations.","ai_keywords":["feed-forward framework","3D scene decomposition","instance-structured token groups","unposed multi-view images","3D Gaussians","differentiable rendering","joint reconstruction and segmentation supervision","class-agnostic instance segmentation","novel view synthesis","instance-level scene editing","open-vocabulary 3D instance retrieval"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66d5730dd51528a038bb09f4","avatarUrl":"/avatars/f9323ef7523a9345a5a5dcd435e8ffa4.svg","isPro":false,"fullname":"Junhee Park","user":"junipark","type":"user"},{"_id":"6626594e21c008472aed9558","avatarUrl":"/avatars/7cb8c0b4a77196fb2972a7a71d5e007a.svg","isPro":false,"fullname":"Subin Jeon","user":"subinjeon","type":"user"},{"_id":"67b9d24bf67f79415b31db1e","avatarUrl":"/avatars/b85a721f8945e3b97ec943949207f49e.svg","isPro":false,"fullname":"Junyoung Hong","user":"shamanneo","type":"user"},{"_id":"665f4528e823e776bce8bff7","avatarUrl":"/avatars/ef49a6e5d24a2a4ae131dd44e85688ca.svg","isPro":false,"fullname":"Youngbeom Yoo","user":"yyb8552","type":"user"},{"_id":"62b7eb18609021927892404c","avatarUrl":"/avatars/4aca2984bc58abb171a79e3a30927173.svg","isPro":false,"fullname":"Jaehyun Kang","user":"jaehyunkang","type":"user"},{"_id":"6513030fb3a463e17df56edd","avatarUrl":"/avatars/867bd4316b2de758654ad3a84ea868c1.svg","isPro":false,"fullname":"Hyun, Jeongseok","user":"js-hyun","type":"user"},{"_id":"674d9f029695f8294c3ab2ec","avatarUrl":"/avatars/2529092b2af267ebe1d5d64ead7d41bc.svg","isPro":false,"fullname":"In Cho","user":"join16","type":"user"},{"_id":"655e0141d36a195f663ee4b0","avatarUrl":"/avatars/97bb695ccefdcb2139b94bcae808cf99.svg","isPro":false,"fullname":"Eunbyung Park","user":"epark","type":"user"},{"_id":"6646d95db866ae2e0441d12e","avatarUrl":"/avatars/31bbb68a27be41e79290474fe46d8628.svg","isPro":false,"fullname":"DongYun Kim","user":"DongYun810","type":"user"},{"_id":"6813c4ed3ff8212e97cf1325","avatarUrl":"/avatars/45459937883d7214df94536500f46bd4.svg","isPro":false,"fullname":"Jeonghwan Cho","user":"jeongvcho","type":"user"},{"_id":"65f90edde7bb1e13196f0be7","avatarUrl":"/avatars/8585ca41e6720c84997f060796728654.svg","isPro":false,"fullname":"Keummin Ka","user":"KEUMMin","type":"user"},{"_id":"6a44ea111883bbdd17e9993f","avatarUrl":"/avatars/ecb38106a49842be299b1ba67db8310f.svg","isPro":false,"fullname":"Kimyongnam","user":"dydska","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.29513.md","query":{}}">
Papers
arxiv:2606.29513

Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Published on Jun 28
· Submitted by
Mijin Yoo
on Jul 1
Authors:
,
,
,
,

Abstract

A feed-forward framework decomposes 3D scenes into instance-structured token groups from multi-view images, enabling direct object-level reconstruction, segmentation, and manipulation without 3D annotations.

A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images -- compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing -- removing, translating, or inserting objects by operating on their groups -- as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.

Community

Paper author Paper submitter about 17 hours ago

A feed-forward framework decomposes 3D scenes into instance-structured token groups from multi-view images, enabling direct object-level reconstruction, segmentation, and manipulation without 3D annotations.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.29513
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.29513 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.29513 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.29513 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers