Hugging Face Daily Papers · July 2, 2026 · 5 min read

MemLearner: Learning to Query Context memory for Video World Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.23105\">Compression and Retrieval: Implicit Memory Retrieval for Video World Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.22718\">WorldKV: Efficient World Memory with World Retrieval and Compression</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.31336\">DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.30045\">Walking in the Implicit: Interactive World Exploration via Neural Scene Representation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.16449\">PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.02479\">Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.28544\">DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code>\n","updatedAt":"2026-07-02T01:47:48.372Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6976968050003052},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.31734","authors":[{"_id":"6a447b6c41f04ae4d7ad96fd","name":"Jiwen Yu","hidden":false},{"_id":"6a447b6c41f04ae4d7ad96fe","name":"Jianxiong Gao","hidden":false},{"_id":"6a447b6c41f04ae4d7ad96ff","name":"Jianhong Bai","hidden":false},{"_id":"6a447b6c41f04ae4d7ad9700","name":"Yiran Qin","hidden":false},{"_id":"6a447b6c41f04ae4d7ad9701","name":"Kaiyi Huang","hidden":false},{"_id":"6a447b6c41f04ae4d7ad9702","name":"Quande Liu","hidden":false},{"_id":"6a447b6c41f04ae4d7ad9703","name":"Xintao Wang","hidden":false},{"_id":"6a447b6c41f04ae4d7ad9704","name":"Pengfei Wan","hidden":false},{"_id":"6a447b6c41f04ae4d7ad9705","name":"Kun Gai","hidden":false},{"_id":"6a447b6c41f04ae4d7ad9706","name":"Xihui Liu","hidden":false}],"publishedAt":"2026-06-30T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"MemLearner: Learning to Query Context memory for Video World Models","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.","upvotes":18,"discussionId":"6a447b6d41f04ae4d7ad9707","projectPage":"https://yujiwen.github.io/memlearner/","ai_summary":"MemLearner improves video world models by using learning-based adaptive context querying with query tokens to enhance scene consistency and memory in long video sequences with occlusions and dynamic objects.","ai_keywords":["video world models","context frame retrieval","query tokens","video generation model","visual priors","multi-dataset training strategy","camera pose annotations","scene consistency","memory"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6683fc5344a65be1aab25dc0","avatarUrl":"/avatars/e13cde3f87b59e418838d702807df3b5.svg","isPro":false,"fullname":"hjkim","user":"hojie11","type":"user"},{"_id":"64105a6d14215c0775dfdd14","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64105a6d14215c0775dfdd14/-VX-cUYOLjHIg7QnWhRGG.jpeg","isPro":false,"fullname":"Jiwen Yu","user":"VictorYuki","type":"user"},{"_id":"64d5c6acdd57652c1a472f2d","avatarUrl":"/avatars/358ea808645b5bf72dd82b07cacf7a78.svg","isPro":false,"fullname":"Xiong Xuyuan","user":"xjxyys","type":"user"},{"_id":"64b4eecf2fc8324fcb63b404","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4eecf2fc8324fcb63b404/zGYqYVB4-o-GBMybJ8CDA.png","isPro":false,"fullname":"Yunhan Yang","user":"yhyang-myron","type":"user"},{"_id":"665abce1d599adace6d66674","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665abce1d599adace6d66674/pd6IiwPkbT4ocOySleXga.jpeg","isPro":false,"fullname":"Selen Su","user":"selensu","type":"user"},{"_id":"6485b08e687d9e0c759121b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6485b08e687d9e0c759121b0/P_9F0izrQgUfEd-VEbhg8.jpeg","isPro":false,"fullname":"sijin","user":"CH3COOK","type":"user"},{"_id":"638ee900ee7e45e0474a5712","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638ee900ee7e45e0474a5712/KLli_eCbWwffKR7oLDmV3.jpeg","isPro":false,"fullname":"Yukun Huang","user":"KevinHuang","type":"user"},{"_id":"60d045c4778bafd0fbcfa3f5","avatarUrl":"/avatars/0cc0c2739c1934430ea09df7e9668c80.svg","isPro":false,"fullname":"Yi Chen","user":"ChenYi99","type":"user"},{"_id":"662f93942510ef5735d7ad00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662f93942510ef5735d7ad00/ZIDIPm63sncIHFTT5b0uR.png","isPro":false,"fullname":"magicwpf","user":"magicwpf","type":"user"},{"_id":"67da745e0f5863ac3197b801","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/4tpzA6yjkU3DA-XhydezI.png","isPro":false,"fullname":"zhou nan","user":"rescuerz","type":"user"},{"_id":"66608add236f958513d21d2e","avatarUrl":"/avatars/53eca0891c98cbb93be899885160a983.svg","isPro":false,"fullname":"Weiyang Jin","user":"Wayne-King","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.31734.md","query":{}}">

Papers

arxiv:2606.31734

MemLearner: Learning to Query Context memory for Video World Models

Published on Jun 30

· Submitted by

taesiri on Jul 1

Upvote

Authors:

Abstract

MemLearner improves video world models by using learning-based adaptive context querying with query tokens to enhance scene consistency and memory in long video sequences with occlusions and dynamic objects.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.