Hugging Face Daily Papers · June 30, 2026 · 3 min read

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Accepted by ECCV2026</p>\n","updatedAt":"2026-06-30T02:14:30.280Z","author":{"_id":"657c03a5538666d04cd47461","avatarUrl":"/avatars/00a7686e08207915ade05b52a84d8e26.svg","fullname":"Chen Yang","name":"Y-Sisyphus","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9560474753379822},"editors":["Y-Sisyphus"],"editorAvatarUrls":["/avatars/00a7686e08207915ade05b52a84d8e26.svg"],"reactions":[{"reaction":"👍","users":["wangsssssss","dongfm"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.26016","authors":[{"_id":"6a41e26e0dbbc53604b66a31","user":{"_id":"657c03a5538666d04cd47461","avatarUrl":"/avatars/00a7686e08207915ade05b52a84d8e26.svg","isPro":false,"fullname":"Chen Yang","user":"Y-Sisyphus","type":"user","name":"Y-Sisyphus"},"name":"Yang Chen","status":"claimed_verified","statusLastChangedAt":"2026-06-29T13:21:59.029Z","hidden":false},{"_id":"6a41e26e0dbbc53604b66a32","name":"Xiaowei Xu","hidden":false},{"_id":"6a41e26e0dbbc53604b66a33","name":"Shuai Wang","hidden":false},{"_id":"6a41e26e0dbbc53604b66a34","name":"Xinwen Zhang","hidden":false},{"_id":"6a41e26e0dbbc53604b66a35","name":"Qiushi Guo","hidden":false},{"_id":"6a41e26e0dbbc53604b66a36","name":"Tiezheng Ge","hidden":false},{"_id":"6a41e26e0dbbc53604b66a37","name":"Limin Wang","hidden":false}],"publishedAt":"2026-06-24T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation","submittedOnDailyBy":{"_id":"657c03a5538666d04cd47461","avatarUrl":"/avatars/00a7686e08207915ade05b52a84d8e26.svg","isPro":false,"fullname":"Chen Yang","user":"Y-Sisyphus","type":"user","name":"Y-Sisyphus"},"summary":"Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\\% fewer than standard models), it yields a 32.8\\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.","upvotes":6,"discussionId":"6a41e26f0dbbc53604b66a38","githubRepo":"https://github.com/MCG-NJU/MIMFlow","githubRepoAddedBy":"user","ai_summary":"MIMFlow combines Normalizing Flows with Masked Image Modeling to improve generative modeling by decoupling semantic representation from pixel-level details, achieving better performance with fewer tokens.","ai_keywords":["Normalizing Flows","Masked Image Modeling","VAE encoder","latent semantics","generative flow","semantic manifold","high-frequency synthesis","linear probing accuracy","FID"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"657c03a5538666d04cd47461","avatarUrl":"/avatars/00a7686e08207915ade05b52a84d8e26.svg","isPro":false,"fullname":"Chen Yang","user":"Y-Sisyphus","type":"user"},{"_id":"62c77f4352d8ae531f5511f9","avatarUrl":"/avatars/50198ccb02ccd286975a4613fbabee28.svg","isPro":false,"fullname":"Limin Wang","user":"lmwang","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},{"_id":"634574357393804ce0d163f2","avatarUrl":"/avatars/0ea7d220a5286e7ae1c7c8fc442ab0cc.svg","isPro":false,"fullname":"Terrasse","user":"Terrasse","type":"user"},{"_id":"68dcd0ac345396e5f6a50c43","avatarUrl":"/avatars/adc31e9aeed9b2ba5d3022417a1851d1.svg","isPro":false,"fullname":"XiaoweiXu","user":"XiaoweiXu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.26016.md","query":{}}">

Papers

arxiv:2606.26016

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Published on Jun 24

· Submitted by

Chen Yang on Jun 30

Upvote

Authors:

Yang Chen ,

Abstract

MIMFlow combines Normalizing Flows with Masked Image Modeling to improve generative modeling by decoupling semantic representation from pixel-level details, achieving better performance with fewer tokens.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.

View arXiv page View PDF GitHub 4 Add to collection

Community

Y-Sisyphus

Paper author Paper submitter about 23 hours ago

Accepted by ECCV2026

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.26016

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.26016 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.26016 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.26016 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers