Hugging Face Daily Papers · · 3 min read

Walking in the Implicit: Interactive World Exploration via Neural Scene Representation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6353b63b06d707b3324279e3/NGIYwe4zasmDaBBA5_DOX.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6353b63b06d707b3324279e3/NGIYwe4zasmDaBBA5_DOX.png\" alt=\"teaser\"></a></p>\n<p>Our system rolls out a fixed-length, renderable Neural Implicit Scene state and renders queried observations under camera control.</p>\n","updatedAt":"2026-06-30T03:38:26.294Z","author":{"_id":"6353b63b06d707b3324279e3","avatarUrl":"/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg","fullname":"Zhiqi Li","name":"lzq49","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5970581769943237},"editors":["lzq49"],"editorAvatarUrls":["/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30045","authors":[{"_id":"6a4339f8763f63ca3757e99f","name":"Zhiqi Li","hidden":false},{"_id":"6a4339f8763f63ca3757e9a0","name":"Chengrui Dong","hidden":false},{"_id":"6a4339f8763f63ca3757e9a1","name":"Zhenhua Du","hidden":false},{"_id":"6a4339f8763f63ca3757e9a2","name":"Hangning Zhou","hidden":false},{"_id":"6a4339f8763f63ca3757e9a3","name":"Cong Qiu","hidden":false},{"_id":"6a4339f8763f63ca3757e9a4","name":"Hailong Qin","hidden":false},{"_id":"6a4339f8763f63ca3757e9a5","name":"Mu Yang","hidden":false},{"_id":"6a4339f8763f63ca3757e9a6","name":"Dongxu Wei","hidden":false},{"_id":"6a4339f8763f63ca3757e9a7","name":"Peidong Liu","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"Walking in the Implicit: Interactive World Exploration via Neural Scene Representation","submittedOnDailyBy":{"_id":"6353b63b06d707b3324279e3","avatarUrl":"/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg","isPro":false,"fullname":"Zhiqi Li","user":"lzq49","type":"user","name":"lzq49"},"summary":"Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.","upvotes":4,"discussionId":"6a4339f9763f63ca3757e9a8","projectPage":"https://lizhiqi49.github.io/NeuWorld","githubRepo":"https://github.com/WU-CVGL/NeuWorld","githubRepoAddedBy":"user","ai_summary":"NeuWorld enables efficient interactive video generation by representing scenes as compact neural implicit states and using a transformer VAE with diffusion transformer for trajectory-conditioned rendering.","ai_keywords":["latent video frames","implicit state","Neural Implicit Scene","transformer VAE","diffusion transformer","pose-conditioned rendering","camera trajectories","geometry-aware retrieval","VAE encoder","unified conditioner","long-horizon consistency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":26},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6353b63b06d707b3324279e3","avatarUrl":"/avatars/d5f2de814a5f7570ad1710b28c22cf88.svg","isPro":false,"fullname":"Zhiqi Li","user":"lzq49","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"69785ccead94585f418e706c","avatarUrl":"/avatars/7f8e02cb71b79eee4413e7439dbabc05.svg","isPro":false,"fullname":"zhang","user":"zhangml233","type":"user"},{"_id":"697860958cbd139e4cf141c3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Q4z-0KXZIwEjVR3ACWwgb.png","isPro":false,"fullname":"Yi Zhi","user":"zzyai","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30045.md","query":{}}">
Papers
arxiv:2606.30045

Walking in the Implicit: Interactive World Exploration via Neural Scene Representation

Published on Jun 29
· Submitted by
Zhiqi Li
on Jun 30
Authors:
,
,
,
,
,
,
,
,

Abstract

NeuWorld enables efficient interactive video generation by representing scenes as compact neural implicit states and using a transformer VAE with diffusion transformer for trajectory-conditioned rendering.

Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.

Community

Paper submitter about 21 hours ago

teaser

Our system rolls out a fixed-length, renderable Neural Implicit Scene state and renders queried observations under camera control.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.30045
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.30045 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.30045 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.30045 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers