Hugging Face Daily Papers · · 4 min read

Trimming the Long-Tail of Visual World Modeling Evaluation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<video src=\"https://cdn-uploads.huggingface.co/production/uploads/6621abac1ee354927a8e0f79/iI_o8BEHWrnhROYN-DU4P.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-06-30T04:04:27.815Z","author":{"_id":"6621abac1ee354927a8e0f79","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/2c88f2PzqGsdxMw9x9aHt.jpeg","fullname":"bingxuan li","name":"bx6d","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4711950123310089},"editors":["bx6d"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/2c88f2PzqGsdxMw9x9aHt.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.24256","authors":[{"_id":"6a43400b763f63ca3757ea47","name":"Bingxuan Li","hidden":false},{"_id":"6a43400b763f63ca3757ea48","name":"Yining Hong","hidden":false},{"_id":"6a43400b763f63ca3757ea49","name":"Cheng Qian","hidden":false},{"_id":"6a43400b763f63ca3757ea4a","name":"Hyeonjeong Ha","hidden":false},{"_id":"6a43400b763f63ca3757ea4b","name":"Jiateng Liu","hidden":false},{"_id":"6a43400b763f63ca3757ea4c","name":"Zhenhailong Wang","hidden":false},{"_id":"6a43400b763f63ca3757ea4d","name":"Yue Guo","hidden":false},{"_id":"6a43400b763f63ca3757ea4e","name":"Yunzhu Li","hidden":false},{"_id":"6a43400b763f63ca3757ea4f","name":"Heng Ji","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6621abac1ee354927a8e0f79/YqFhRTyu2hZ0Fm_ute4EF.mp4"],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"Trimming the Long-Tail of Visual World Modeling Evaluation","submittedOnDailyBy":{"_id":"6621abac1ee354927a8e0f79","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/2c88f2PzqGsdxMw9x9aHt.jpeg","isPro":false,"fullname":"bingxuan li","user":"bx6d","type":"user","name":"bx6d"},"summary":"Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.","upvotes":35,"discussionId":"6a43400b763f63ca3757ea50","projectPage":"https://tailor-bench.github.io","githubRepo":"https://github.com/tailor-bench/code","githubRepoAddedBy":"user","ai_summary":"Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks.","ai_keywords":["visual world models","physical interactions","long-tailed distribution","image generation","video generation","world model evaluation","scenario modes","regular scenarios","unconventional scenarios","impossible scenarios","predictive generation","descriptive generation","physical principle generalization","affordance generalization","constraint awareness","temporal consistency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64da133d50d53d4f530f7903","avatarUrl":"/avatars/be25fe08228d086d737d886087e3b425.svg","isPro":false,"fullname":"Hyeonjeong Ha","user":"hjha","type":"user"},{"_id":"66783baec3f824dde8f783ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg","isPro":false,"fullname":"Jeff","user":"JiayuJeff","type":"user"},{"_id":"6a3a1751122573bf1f98f4c3","avatarUrl":"/avatars/03f8f778cb4316e2880058ae67bfd6eb.svg","isPro":false,"fullname":"j185163","user":"j185163","type":"user"},{"_id":"6679a0e092e6bd0b961bfdb2","avatarUrl":"/avatars/0b7f66a25c1681d1984dc03552e8f42d.svg","isPro":false,"fullname":"LIU Jiayu","user":"JeffLiu2005","type":"user"},{"_id":"6a3a1559d566bf8f5224b8d5","avatarUrl":"/avatars/986316b772baf175989e32d2e4c23e0a.svg","isPro":false,"fullname":"jyjacademic","user":"jyjacademic","type":"user"},{"_id":"6a3a1621c69ee5aae6c8329d","avatarUrl":"/avatars/7907d753284e6ce7f62b2f211bbb32b2.svg","isPro":false,"fullname":"jyjgmail","user":"jyjgmail","type":"user"},{"_id":"665e121c6007027038fd4005","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/sIVBJAGM-Kneq9KMf8aXb.png","isPro":false,"fullname":"Cheng Qian","user":"chengq9","type":"user"},{"_id":"628d7265db4cd1d1717c884f","avatarUrl":"/avatars/dff2a3dd10d84b4a73fa486402de7219.svg","isPro":false,"fullname":"Zhenhailong Wang","user":"mikewang","type":"user"},{"_id":"6431b64df76c34519e93d1ba","avatarUrl":"/avatars/ea577762b6b4798f87a7a3f1d53d082c.svg","isPro":false,"fullname":"Yining Hong","user":"evelynhong","type":"user"},{"_id":"68087b4f3f5cc7179ae959a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/l9skgMVKXJollx6BwNaWm.png","isPro":false,"fullname":"Xiaocheng Yang","user":"Xiaocheng-Yang","type":"user"},{"_id":"63f5dafa9cbd67303023f184","avatarUrl":"/avatars/905e6d5bf9b920bf40c55795c5df7c2e.svg","isPro":false,"fullname":"Logic🤗","user":"Looogic","type":"user"},{"_id":"64a5583f1e1d475f6da4928c","avatarUrl":"/avatars/a46ddc1479ab6e1d14568822b7546a69.svg","isPro":false,"fullname":"Heng Wang","user":"Heng-Wang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.24256.md","query":{}}">
Papers
arxiv:2606.24256

Trimming the Long-Tail of Visual World Modeling Evaluation

Published on Jun 23
· Submitted by
bingxuan li
on Jun 30
Authors:
,
,
,
,
,
,
,
,

Abstract

Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks.

Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.

Community

Paper submitter about 21 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.24256
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.24256 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.24256 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.24256 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers