Hugging Face Daily Papers · July 1, 2026 · 6 min read

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation\n","updatedAt":"2026-07-01T09:17:10.719Z","author":{"_id":"676a7d235303833e60a3edc5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/676a7d235303833e60a3edc5/-hmwUmlm-IYvwpH8P5AuA.jpeg","fullname":"caoshuo","name":"Thunderbolt215215","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7133522629737854},"editors":["Thunderbolt215215"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/676a7d235303833e60a3edc5/-hmwUmlm-IYvwpH8P5AuA.jpeg"],"reactions":[],"isReport":false}},{"id":"6a45c38fa6b813d09688a4f6","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:49:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MemoGen: Can Past Experience Improve Future Text-to-Image Generation?](https://huggingface.co/papers/2606.03243) (2026)\n* [EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation](https://huggingface.co/papers/2605.11722) (2026)\n* [Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation](https://huggingface.co/papers/2606.26907) (2026)\n* [Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs](https://huggingface.co/papers/2605.30611) (2026)\n* [TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards](https://huggingface.co/papers/2605.19320) (2026)\n* [SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation](https://huggingface.co/papers/2605.08043) (2026)\n* [Large Language Models are Universal Reasoners for Visual Generation](https://huggingface.co/papers/2605.04040) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.03243\">MemoGen: Can Past Experience Improve Future Text-to-Image Generation?</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11722\">EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.26907\">Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30611\">Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.19320\">TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.08043\">SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.04040\">Large Language Models are Universal Reasoners for Visual Generation</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code>\n","updatedAt":"2026-07-02T01:49:03.300Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6974087953567505},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.31537","authors":[{"_id":"6a44da7441f04ae4d7ad9a28","name":"Siyu Yan","hidden":false},{"_id":"6a44da7441f04ae4d7ad9a29","name":"Yizhen Gao","hidden":false},{"_id":"6a44da7441f04ae4d7ad9a2a","name":"Yilin Wang","hidden":false},{"_id":"6a44da7441f04ae4d7ad9a2b","name":"Dongxing Mao","hidden":false},{"_id":"6a44da7441f04ae4d7ad9a2c","name":"Alex Jinpeng Wang","hidden":false}],"publishedAt":"2026-06-30T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation","submittedOnDailyBy":{"_id":"676a7d235303833e60a3edc5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/676a7d235303833e60a3edc5/-hmwUmlm-IYvwpH8P5AuA.jpeg","isPro":false,"fullname":"caoshuo","user":"Thunderbolt215215","type":"user","name":"Thunderbolt215215"},"summary":"Text-rich image generation is one of the most challenging settings in image generation, since models must simultaneously produce visually realistic images and render legible, semantically aligned, and layout-consistent text. Existing data pipelines usually follow a static crawl-filter-freeze paradigm. They collect candidate samples, filter them once, and freeze the accepted data for training. However, rejected samples are usually discarded, although they often contain useful failure signals such as OCR errors and semantic mismatches. As a result, later construction rounds may repeat the same failure modes. To address these limitations, we propose DataEvolver, a self-evolving multi-agent framework for text-rich image data construction. DataEvolver treats data construction as feedback-driven construction policy evolution. A Retriever collects candidate samples, a Verifier assigns quality scores and rejection causes, a Critic summarizes round-level feedback into semantic feedback, and a Generator completes under-covered regions through targeted synthesis. The updated feedback memory then guides the next construction round. Experiments on text-rich image generation benchmarks show that DataEvolver produces more useful training data than fixed-dataset baselines under matched data budgets. At the 0.75M scale on PixArt-alpha, DataEvolver improves OCR-F1 over the strongest baseline by 85.3 percent on TextScenesHQ and 35.3 percent on LongTextBench. The improvements are consistent across both evaluated benchmarks and also transfer to Show-o2, indicating that the benefit of DataEvolver is not tied to a single downstream generator. These results suggest that rejected samples can provide actionable feedback for improving text-rich image data construction.","upvotes":16,"discussionId":"6a44da7441f04ae4d7ad9a2d","ai_summary":"DataEvolver is a self-evolving multi-agent framework that improves text-rich image generation by leveraging feedback from rejected samples to iteratively enhance data quality.","ai_keywords":["text-rich image generation","data construction","feedback-driven evolution","multi-agent framework","OCR-F1","semantic feedback","data budget","downstream generator"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"670fd9845a840c8eaba8d70a","avatarUrl":"/avatars/3714191fda634387ad4e94d96a7cf4d0.svg","isPro":false,"fullname":"Siyu Yan","user":"SiyuYanYan","type":"user"},{"_id":"676a7d235303833e60a3edc5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/676a7d235303833e60a3edc5/-hmwUmlm-IYvwpH8P5AuA.jpeg","isPro":false,"fullname":"caoshuo","user":"Thunderbolt215215","type":"user"},{"_id":"65116f77e15da7d6cbe2edc9","avatarUrl":"/avatars/b2ad749b57d082f5a2bded70aeb007d5.svg","isPro":false,"fullname":"tailuo","user":"chenhaoji","type":"user"},{"_id":"68b2a4d2617991e304a19b64","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68b2a4d2617991e304a19b64/83m0OnfT6WYlZPoqE7tdi.jpeg","isPro":false,"fullname":"Siyu ZHANG","user":"Luneishevy","type":"user"},{"_id":"68b818494e58002295fdb5a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68b818494e58002295fdb5a3/4UAaeV7sFgJkKQLMp0iv_.jpeg","isPro":false,"fullname":"Jason Cai","user":"WaterCoFire","type":"user"},{"_id":"67a813554cf6257be30e92c6","avatarUrl":"/avatars/4628475b809805056d20d84aea1fd3e7.svg","isPro":false,"fullname":"LauvAri","user":"LauvAri","type":"user"},{"_id":"65818ecd20e57e0ebf5d90e9","avatarUrl":"/avatars/bd80a92013d930acf3fbc946d3a2bf67.svg","isPro":false,"fullname":"Richard Lee","user":"lixin4sky","type":"user"},{"_id":"63ac5701c21e60a3e9b58aa7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ac5701c21e60a3e9b58aa7/g6EX7diOpuA94R2ab-rZC.png","isPro":true,"fullname":"Dipankar Sarkar","user":"dipankarsarkar","type":"user"},{"_id":"686dcb1cceda9577313fdfde","avatarUrl":"/avatars/9f7803d868fd5b5b5e5e4063d4e516d5.svg","isPro":false,"fullname":"suif","user":"suif11","type":"user"},{"_id":"68873d32b3e1116515ba63cd","avatarUrl":"/avatars/34c6186256629eb4d42a5c8b3856c1d1.svg","isPro":false,"fullname":"Cedar Zeng","user":"Cedar1","type":"user"},{"_id":"65aa6ae215102fd65968615d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65aa6ae215102fd65968615d/Zs3ZXXblHZLEgVlawQV0p.jpeg","isPro":false,"fullname":"Yongheng Zhang","user":"BRZ911","type":"user"},{"_id":"67af3161e98bfe8c28583a4f","avatarUrl":"/avatars/0a5b6c9e5fa904370d72638e3183f932.svg","isPro":false,"fullname":"Yifan Su","user":"Leslie04","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.31537.md","query":{}}">

Papers

arxiv:2606.31537

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

Published on Jun 30

· Submitted by

caoshuo on Jul 1

Upvote

Authors:

Abstract

DataEvolver is a self-evolving multi-agent framework that improves text-rich image generation by leveraging feedback from rejected samples to iteratively enhance data quality.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Text-rich image generation is one of the most challenging settings in image generation, since models must simultaneously produce visually realistic images and render legible, semantically aligned, and layout-consistent text. Existing data pipelines usually follow a static crawl-filter-freeze paradigm. They collect candidate samples, filter them once, and freeze the accepted data for training. However, rejected samples are usually discarded, although they often contain useful failure signals such as OCR errors and semantic mismatches. As a result, later construction rounds may repeat the same failure modes. To address these limitations, we propose DataEvolver, a self-evolving multi-agent framework for text-rich image data construction. DataEvolver treats data construction as feedback-driven construction policy evolution. A Retriever collects candidate samples, a Verifier assigns quality scores and rejection causes, a Critic summarizes round-level feedback into semantic feedback, and a Generator completes under-covered regions through targeted synthesis. The updated feedback memory then guides the next construction round. Experiments on text-rich image generation benchmarks show that DataEvolver produces more useful training data than fixed-dataset baselines under matched data budgets. At the 0.75M scale on PixArt-alpha, DataEvolver improves OCR-F1 over the strongest baseline by 85.3 percent on TextScenesHQ and 35.3 percent on LongTextBench. The improvements are consistent across both evaluated benchmarks and also transfer to Show-o2, indicating that the benefit of DataEvolver is not tied to a single downstream generator. These results suggest that rejected samples can provide actionable feedback for improving text-rich image data construction.