Hugging Face Daily Papers · · 4 min read

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Github: <a href=\"https://github.com/amap-cvlab/ABot-Manipulation\" rel=\"nofollow\">https://github.com/amap-cvlab/ABot-Manipulation</a></p>\n","updatedAt":"2026-07-02T02:22:25.492Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":329,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7152544260025024},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2607.00678","authors":[{"_id":"6a45cb4d4f1dd35e48fb8eca","name":"Ronghan Chen","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ecb","name":"Yandan Yang","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ecc","name":"Zuojin Tang","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ecd","name":"Dongjie Huo","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ece","name":"Tong Lin","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ecf","name":"Haoning Wu","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed0","name":"Haoyun Liu","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed1","name":"Yuzhi Chen","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed2","name":"Lulu Zheng","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed3","name":"Botai Yuan","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed4","name":"Tianlun Li","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed5","name":"Mingxin Wang","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed6","name":"Dekang Qi","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed7","name":"Bin Hu","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed8","name":"Wei Mei","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ed9","name":"Yuze Xuan","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8eda","name":"Haolong Yang","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8edb","name":"Yanqing Zhu","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8edc","name":"Mu Xu","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8edd","name":"Zhiheng Ma","hidden":false},{"_id":"6a45cb4d4f1dd35e48fb8ede","name":"Xinyuan Chang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6039478ab3ecf716b1a5fd4d/WjyLyiXQi84Ta10SgaicE.jpeg"],"publishedAt":"2026-07-01T00:00:00.000Z","submittedOnDailyAt":"2026-07-02T00:00:00.000Z","title":"ABot-M0.5: Unified Mobility-and-Manipulation World Action Model","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.","upvotes":11,"discussionId":"6a45cb4d4f1dd35e48fb8edf","projectPage":"https://amap-cvlab.github.io/ABot-Manipulation/","ai_summary":"ABot-M0.5 is a World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency in autoregressive prediction.","ai_keywords":["World Action Models","mobile manipulation","temporal granularity","action space","Mixture-of-Transformers","inverse dynamics","dream-forcing","autoregressive prediction","fine-grained control"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"632c7a0d1d303f5f9acf01b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632c7a0d1d303f5f9acf01b8/T010IFuCp6UaOeIyWhbCk.jpeg","isPro":false,"fullname":"Haoning Wu","user":"haoningwu","type":"user"},{"_id":"686d185699645df570892710","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/zjMBsWvDnRSxSHe0yJC7h.png","isPro":false,"fullname":"wangmingxinthu","user":"wangmingxinthu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"6683ab4fc7d9ff09221fbcdb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683ab4fc7d9ff09221fbcdb/akEUPi5IjMX2KKIYe1xVG.jpeg","isPro":false,"fullname":"YandanYang","user":"yangyandan","type":"user"},{"_id":"64b938b8296e311ff66efb83","avatarUrl":"/avatars/fe4b251a7765f72f0ce6f603418b756a.svg","isPro":false,"fullname":"Louis Chen","user":"GostInShell","type":"user"},{"_id":"6a463ae763029eb4af81690f","avatarUrl":"/avatars/06db42e0f4e9b0708528bc747e304475.svg","isPro":false,"fullname":"ZJ Tang","user":"Mark-ZJTang","type":"user"},{"_id":"66712d68293b94a4f485bd36","avatarUrl":"/avatars/c0a69c835a62c2634e6d6e2004e0abff.svg","isPro":false,"fullname":"aa","user":"404dreamer","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"66935bdc5489e4f73c76bc7b","avatarUrl":"/avatars/129d1e86bbaf764b507501f4feb177db.svg","isPro":false,"fullname":"Abidoye Aanuoluwapo","user":"Aanuoluwapo65","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2607/2607.00678.md","query":{}}">
Papers
arxiv:2607.00678

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Published on Jul 1
· Submitted by
taesiri
on Jul 2
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

ABot-M0.5 is a World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency in autoregressive prediction.

Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2607.00678
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2607.00678 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2607.00678 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.00678 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers