<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6353b63b06d707b3324279e3/NGIYwe4zasmDaBBA5_DOX.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6353b63b06d707b3324279e3/NGIYwe4zasmDaBBA5_DOX.png\" alt=\"teaser\"></a></p>\n<p>Our system rolls out a fixed-length, renderable Neural Implicit Scene state and renders queried observations under camera control.</p>\n","updatedAt":"2026-06-30T03:38:26.294Z","author":{"_id":"6353b63b06d707b3324279e3","avatarUrl":"/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg","fullname":"Zhiqi Li","name":"lzq49","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5970581769943237},"editors":["lzq49"],"editorAvatarUrls":["/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30045","authors":[{"_id":"6a4339f8763f63ca3757e99f","name":"Zhiqi Li","hidden":false},{"_id":"6a4339f8763f63ca3757e9a0","name":"Chengrui Dong","hidden":false},{"_id":"6a4339f8763f63ca3757e9a1","name":"Zhenhua Du","hidden":false},{"_id":"6a4339f8763f63ca3757e9a2","name":"Hangning Zhou","hidden":false},{"_id":"6a4339f8763f63ca3757e9a3","name":"Cong Qiu","hidden":false},{"_id":"6a4339f8763f63ca3757e9a4","name":"Hailong Qin","hidden":false},{"_id":"6a4339f8763f63ca3757e9a5","name":"Mu Yang","hidden":false},{"_id":"6a4339f8763f63ca3757e9a6","name":"Dongxu Wei","hidden":false},{"_id":"6a4339f8763f63ca3757e9a7","name":"Peidong Liu","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"Walking in the Implicit: Interactive World Exploration via Neural Scene Representation","submittedOnDailyBy":{"_id":"6353b63b06d707b3324279e3","avatarUrl":"/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg","isPro":false,"fullname":"Zhiqi Li","user":"lzq49","type":"user","name":"lzq49"},"summary":"Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.","upvotes":4,"discussionId":"6a4339f9763f63ca3757e9a8","projectPage":"https://lizhiqi49.github.io/NeuWorld","githubRepo":"https://github.com/WU-CVGL/NeuWorld","githubRepoAddedBy":"user","ai_summary":"NeuWorld enables efficient interactive video generation by representing scenes as compact neural implicit states and using a transformer VAE with diffusion transformer for trajectory-conditioned rendering.","ai_keywords":["latent video frames","implicit state","Neural Implicit Scene","transformer VAE","diffusion transformer","pose-conditioned rendering","camera trajectories","geometry-aware retrieval","VAE encoder","unified conditioner","long-horizon consistency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":26},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6353b63b06d707b3324279e3","avatarUrl":"/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg","isPro":false,"fullname":"Zhiqi Li","user":"lzq49","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"69785ccead94585f418e706c","avatarUrl":"/avatars/7f8e02cb71b79eee4413e7439dbabc05.svg","isPro":false,"fullname":"zhang","user":"zhangml233","type":"user"},{"_id":"697860958cbd139e4cf141c3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Q4z-0KXZIwEjVR3ACWwgb.png","isPro":false,"fullname":"Yi Zhi","user":"zzyai","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30045.md","query":{}}">
Walking in the Implicit: Interactive World Exploration via Neural Scene Representation
Abstract
NeuWorld enables efficient interactive video generation by representing scenes as compact neural implicit states and using a transformer VAE with diffusion transformer for trajectory-conditioned rendering.
Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.
Community

Our system rolls out a fixed-length, renderable Neural Implicit Scene state and renders queried observations under camera control.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.30045 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.30045 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.30045 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.