Hugging Face Daily Papers · July 1, 2026 · 5 min read

Xiaomi-GUI-0 Technical Report

#model-release #multimodal #agents #benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.19930\">MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25160\">ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.12817\">GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.22948\">ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10347\">How Mobile World Model Guides GUI Agents?</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26546\">MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12481\">ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code>\n","updatedAt":"2026-07-02T01:46:38.834Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6796663403511047},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.31410","authors":[{"_id":"6a447c2d41f04ae4d7ad9719","name":"Wanxia Cao","hidden":false},{"_id":"6a447c2d41f04ae4d7ad971a","name":"Chengzhen Duan","hidden":false},{"_id":"6a447c2d41f04ae4d7ad971b","name":"Pei Fu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad971c","name":"Pengzhi Gao","hidden":false},{"_id":"6a447c2d41f04ae4d7ad971d","name":"Niu Lian","hidden":false},{"_id":"6a447c2d41f04ae4d7ad971e","name":"Fazhan Liu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad971f","name":"Hui Liu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9720","name":"Heng Qu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9721","name":"Qinzhuo Wu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9722","name":"Zhehao Yu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9723","name":"Tongbo Chen","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9724","name":"Shiqi Cui","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9725","name":"Anan Du","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9726","name":"Shukai Jia","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9727","name":"Yuanfa Li","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9728","name":"Yike Liu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9729","name":"Wenchao Lu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad972a","name":"Haoyuan Sun","hidden":false},{"_id":"6a447c2d41f04ae4d7ad972b","name":"Jiatong Sun","hidden":false},{"_id":"6a447c2d41f04ae4d7ad972c","name":"Cheng Tan","hidden":false},{"_id":"6a447c2d41f04ae4d7ad972d","name":"Yajie Wang","hidden":false},{"_id":"6a447c2d41f04ae4d7ad972e","name":"Changqiao Wu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad972f","name":"Tao Xiong","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9730","name":"Jiahui Yang","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9731","name":"Yuxuan Yuan","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9732","name":"Ruoceng Zhang","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9733","name":"Shaojie Zhang","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9734","name":"Jian Zhu","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9735","name":"Jian Luan","hidden":false},{"_id":"6a447c2d41f04ae4d7ad9736","name":"Cong Zou","hidden":false}],"publishedAt":"2026-06-30T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"Xiaomi-GUI-0 Technical Report","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.","upvotes":8,"discussionId":"6a447c2d41f04ae4d7ad9737","projectPage":"https://seerray-lab.github.io/Xiaomi-GUI-0/","ai_summary":"A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches.","ai_keywords":["vision-language models","interface actions","real-device closed loop","hybrid infrastructure","data flywheel","supervised fine-tuning","reinforcement learning","agentic reinforcement learning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6821ba7e5a7efab94a235406","name":"xiaomi-research","fullname":"Xiaomi Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/673735e4373ad40af7f81ea1/DR4m0bz2Du1l0Z8Txg351.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"64db17edd68a6ddcc7b3ffd9","avatarUrl":"/avatars/61628600921e4641c1ddd034f3837e9a.svg","isPro":false,"fullname":"Pengzhi Gao","user":"gpengzhi","type":"user"},{"_id":"66a697248b85fa8a34005aed","avatarUrl":"/avatars/3d7f26676ad7e91e540929f1f04b33fd.svg","isPro":false,"fullname":"Tao Xiong","user":"YuanDaozeiii","type":"user"},{"_id":"672f5f86359d27c87963553d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-hTmvLgCD22VOVWP7Wq3L.png","isPro":false,"fullname":"Timex Peachtree","user":"TimexPeachtree","type":"user"},{"_id":"63ac5701c21e60a3e9b58aa7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ac5701c21e60a3e9b58aa7/g6EX7diOpuA94R2ab-rZC.png","isPro":true,"fullname":"Dipankar Sarkar","user":"dipankarsarkar","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"69a5cba5ee290d6bb49457b8","avatarUrl":"/avatars/f80c17c13d6baf6bcd375d31efe21116.svg","isPro":true,"fullname":"Darrow O'Lykos","user":"darrowoflykos","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6821ba7e5a7efab94a235406","name":"xiaomi-research","fullname":"Xiaomi Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/673735e4373ad40af7f81ea1/DR4m0bz2Du1l0Z8Txg351.png"},"query":{}}">

Papers

arxiv:2606.31410

Xiaomi-GUI-0 Technical Report

Published on Jun 30

· Submitted by

taesiri on Jul 1

Xiaomi Research

Upvote

Authors:

Abstract

A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.

View arXiv page View PDF Project page Add to collection

Community

librarian-bot

15 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.31410 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.31410 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.31410 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

No comments yet. Sign in and be the first to say something.

Xiaomi-GUI-0 Technical Report

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 3

Discussion (0)

More from Hugging Face Daily Papers