Hugging Face Daily Papers · July 1, 2026 · 6 min read

Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A feed-forward framework decomposes 3D scenes into instance-structured token groups from multi-view images, enabling direct object-level reconstruction, segmentation, and manipulation without 3D annotations.\n","updatedAt":"2026-07-01T08:54:34.796Z","author":{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png","fullname":"Mijin Yoo","name":"mynameisyoomimi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7665735483169556},"editors":["mynameisyoomimi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png"],"reactions":[],"isReport":false}},{"id":"6a45c33ee6a31bbaf1961937","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting](https://huggingface.co/papers/2605.04506) (2026)\n* [OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives](https://huggingface.co/papers/2605.20044) (2026)\n* [ZipSplat: Fewer Gaussians, Better Splats](https://huggingface.co/papers/2606.05102) (2026)\n* [Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction](https://huggingface.co/papers/2605.31595) (2026)\n* [OCH3R: Object-Centric Holistic 3D Reconstruction](https://huggingface.co/papers/2605.13018) (2026)\n* [LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images](https://huggingface.co/papers/2605.23287) (2026)\n* [ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views](https://huggingface.co/papers/2605.24304) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.04506\">Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20044\">OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.05102\">ZipSplat: Fewer Gaussians, Better Splats</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.31595\">Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.13018\">OCH3R: Object-Centric Holistic 3D Reconstruction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23287\">LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.24304\">ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code>\n","updatedAt":"2026-07-02T01:47:42.539Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6701268553733826},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.29513","authors":[{"_id":"6a436bcd763f63ca3757eafb","user":{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png","isPro":false,"fullname":"Mijin Yoo","user":"mynameisyoomimi","type":"user","name":"mynameisyoomimi"},"name":"Mijin Yoo","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:46:14.563Z","hidden":false},{"_id":"6a436bcd763f63ca3757eafc","name":"In Cho","hidden":false},{"_id":"6a436bcd763f63ca3757eafd","name":"Subin Jeon","hidden":false},{"_id":"6a436bcd763f63ca3757eafe","name":"Jiwoo Lee","hidden":false},{"_id":"6a436bcd763f63ca3757eaff","name":"Eunbyung Park","hidden":false},{"_id":"6a436bcd763f63ca3757eb00","name":"Seon Joo Kim","hidden":false}],"publishedAt":"2026-06-28T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views","submittedOnDailyBy":{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mntXoTKke2rmgCR70IvJr.png","isPro":false,"fullname":"Mijin Yoo","user":"mynameisyoomimi","type":"user","name":"mynameisyoomimi"},"summary":"A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images -- compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing -- removing, translating, or inserting objects by operating on their groups -- as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.","upvotes":36,"discussionId":"6a436bcd763f63ca3757eb01","projectPage":"https://yoomimi.github.io/instok3d","ai_summary":"A feed-forward framework decomposes 3D scenes into instance-structured token groups from multi-view images, enabling direct object-level reconstruction, segmentation, and manipulation without 3D annotations.","ai_keywords":["feed-forward framework","3D scene decomposition","instance-structured token groups","unposed multi-view images","3D Gaussians","differentiable rendering","joint reconstruction and segmentation supervision","class-agnostic instance segmentation","novel view synthesis","instance-level scene editing","open-vocabulary 3D instance retrieval"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66d5730dd51528a038bb09f4","avatarUrl":"/avatars/f9323ef7523a9345a5a5dcd435e8ffa4.svg","isPro":false,"fullname":"Junhee Park","user":"junipark","type":"user"},{"_id":"6626594e21c008472aed9558","avatarUrl":"/avatars/7cb8c0b4a77196fb2972a7a71d5e007a.svg","isPro":false,"fullname":"Subin Jeon","user":"subinjeon","type":"user"},{"_id":"67b9d24bf67f79415b31db1e","avatarUrl":"/avatars/b85a721f8945e3b97ec943949207f49e.svg","isPro":false,"fullname":"Junyoung Hong","user":"shamanneo","type":"user"},{"_id":"665f4528e823e776bce8bff7","avatarUrl":"/avatars/ef49a6e5d24a2a4ae131dd44e85688ca.svg","isPro":false,"fullname":"Youngbeom Yoo","user":"yyb8552","type":"user"},{"_id":"62b7eb18609021927892404c","avatarUrl":"/avatars/4aca2984bc58abb171a79e3a30927173.svg","isPro":false,"fullname":"Jaehyun Kang","user":"jaehyunkang","type":"user"},{"_id":"6513030fb3a463e17df56edd","avatarUrl":"/avatars/867bd4316b2de758654ad3a84ea868c1.svg","isPro":false,"fullname":"Hyun, Jeongseok","user":"js-hyun","type":"user"},{"_id":"674d9f029695f8294c3ab2ec","avatarUrl":"/avatars/2529092b2af267ebe1d5d64ead7d41bc.svg","isPro":false,"fullname":"In Cho","user":"join16","type":"user"},{"_id":"655e0141d36a195f663ee4b0","avatarUrl":"/avatars/97bb695ccefdcb2139b94bcae808cf99.svg","isPro":false,"fullname":"Eunbyung Park","user":"epark","type":"user"},{"_id":"6646d95db866ae2e0441d12e","avatarUrl":"/avatars/31bbb68a27be41e79290474fe46d8628.svg","isPro":false,"fullname":"DongYun Kim","user":"DongYun810","type":"user"},{"_id":"6813c4ed3ff8212e97cf1325","avatarUrl":"/avatars/45459937883d7214df94536500f46bd4.svg","isPro":false,"fullname":"Jeonghwan Cho","user":"jeongvcho","type":"user"},{"_id":"65f90edde7bb1e13196f0be7","avatarUrl":"/avatars/8585ca41e6720c84997f060796728654.svg","isPro":false,"fullname":"Keummin Ka","user":"KEUMMin","type":"user"},{"_id":"6a44ea111883bbdd17e9993f","avatarUrl":"/avatars/ecb38106a49842be299b1ba67db8310f.svg","isPro":false,"fullname":"Kimyongnam","user":"dydska","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.29513.md","query":{}}">

Papers

arxiv:2606.29513

Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Published on Jun 28

· Submitted by

Mijin Yoo on Jul 1

Upvote

Authors:

Mijin Yoo ,

Abstract

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images -- compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing -- removing, translating, or inserting objects by operating on their groups -- as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.