Hugging Face Daily Papers · July 1, 2026 · 12 min read

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

📄 Paper: <a href=\"https://arxiv.org/abs/2606.32039\" rel=\"nofollow\">https://arxiv.org/abs/2606.32039</a> 💻 Code: <a href=\"https://github.com/Tencent-Hunyuan/GEAR\" rel=\"nofollow\">https://github.com/Tencent-Hunyuan/GEAR</a> 🤗 Models: <a href=\"https://huggingface.co/collections/BinLin203\">https://huggingface.co/collections/BinLin203</a> 🏠 Homepage: <a href=\"https://linb203.github.io/gear\" rel=\"nofollow\">https://linb203.github.io/gear</a>\n[1/6] 🚨 Stop freezing your tokenizers! Visual Autoregressive (AR) models are stuck in the 2-stage era. We present GEAR ⚙️ (Guided End-to-End AutoRegression). By jointly training the VQ tokenizer & AR generator, we achieve up to 10x faster convergence than strong baselines like LlamaGen-REPA! 🧵👇\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/VLzoD2o9WKDynzjF9-w_Q.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/VLzoD2o9WKDynzjF9-w_Q.png\" alt=\"QQ20260701-134407\"></a>\n","updatedAt":"2026-07-01T06:41:46.321Z","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7053239941596985},"editors":["BinLin203"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp"],"reactions":[{"reaction":"🔥","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"🚀","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"👀","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"❤️","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"🤗","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"😎","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"➕","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"🧠","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"👍","users":["LanguageBind","BinLin203"],"count":2},{"reaction":"🤝","users":["LanguageBind","BinLin203"],"count":2}],"isReport":false}},{"id":"6a44b606a4e5de5c5874ee5c","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-07-01T06:39:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"[2/6] 🧠 The Bottleneck & The Solution Why is end-to-end discrete AR so hard? The non-differentiable argmax! Naive Straight-Through Estimator (STE) causes the codebook to collapse. 💥 GEAR solves this with a brilliant Dual Read-out mechanism: 1️⃣ Hard branch: Trains AR with next-token prediction. 2️⃣ Soft branch: A differentiable path carrying alignment loss back to guide only the tokenizer.\n\n![QQ20260701-141509](https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/QnErHBdQGAYzYh1yxfIJU.png)\n","html":"[2/6] 🧠 The Bottleneck & The Solution Why is end-to-end discrete AR so hard? The non-differentiable argmax! Naive Straight-Through Estimator (STE) causes the codebook to collapse. 💥 GEAR solves this with a brilliant Dual Read-out mechanism: 1️⃣ Hard branch: Trains AR with next-token prediction. 2️⃣ Soft branch: A differentiable path carrying alignment loss back to guide only the tokenizer.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/QnErHBdQGAYzYh1yxfIJU.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/QnErHBdQGAYzYh1yxfIJU.png\" alt=\"QQ20260701-141509\"></a>\n","updatedAt":"2026-07-01T06:39:02.418Z","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6913504004478455},"editors":["BinLin203"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp"],"reactions":[],"isReport":false}},{"id":"6a44b619cba0682aeb0570b6","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-07-01T06:39:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"[3/6] 🤯 The Mind-Blowing Finding (Representation Shift) Here is the most surprising part: Unlike diffusion models (where end-to-end training makes latents more semantic), GEAR does the exact OPPOSITE! The tokenizer becomes less DINOv2-like, reorganizing for pure predictability (lower entropy). The semantic alignment burden shifts entirely to the AR model's hidden states!\n\n![QQ20260701-135550](https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/luAhtpi-Ehni_F4H7MQI2.png)\n![QQ20260701-135602](https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/cHgv1nhOQkylg6OpjUzY5.png)\n","html":"[3/6] 🤯 The Mind-Blowing Finding (Representation Shift) Here is the most surprising part: Unlike diffusion models (where end-to-end training makes latents more semantic), GEAR does the exact OPPOSITE! The tokenizer becomes less DINOv2-like, reorganizing for pure predictability (lower entropy). The semantic alignment burden shifts entirely to the AR model's hidden states!\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/luAhtpi-Ehni_F4H7MQI2.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/luAhtpi-Ehni_F4H7MQI2.png\" alt=\"QQ20260701-135550\"></a> <a href=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/cHgv1nhOQkylg6OpjUzY5.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/cHgv1nhOQkylg6OpjUzY5.png\" alt=\"QQ20260701-135602\"></a>\n","updatedAt":"2026-07-01T06:39:21.512Z","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7324966788291931},"editors":["BinLin203"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp"],"reactions":[],"isReport":false}},{"id":"6a44b62891942cbbc36a0018","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-07-01T06:39:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"[4/6] 🚀 The Results: Faster & Better The numbers speak for themselves. On ImageNet 256x256, GEAR consistently beats baselines across B/L/XL scales (gFID drops to 2.52). On Text-to-Image (GPIC), it reaches the baseline's final REPA alignment loss 11.1x faster and NTP loss 2.5x faster! ⚡️\n\n![QQ20260701-141522](https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/P8TtHLlLtvxEhJZyxfe67.png)\n","html":"[4/6] 🚀 The Results: Faster & Better The numbers speak for themselves. On ImageNet 256x256, GEAR consistently beats baselines across B/L/XL scales (gFID drops to 2.52). On Text-to-Image (GPIC), it reaches the baseline's final REPA alignment loss 11.1x faster and NTP loss 2.5x faster! ⚡️\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/P8TtHLlLtvxEhJZyxfe67.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/P8TtHLlLtvxEhJZyxfe67.png\" alt=\"QQ20260701-141522\"></a>\n","updatedAt":"2026-07-01T06:39:36.524Z","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.661665141582489},"editors":["BinLin203"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp"],"reactions":[],"isReport":false}},{"id":"6a44b6377fd08706168c3f45","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-07-01T06:39:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"[5/6] 🔧 Generality Across Quantizers GEAR isn't just a one-trick pony for VQ-VAE. The soft-guidance mechanism is highly general! It works seamlessly across different quantizers: VQVAE, LFQ, and IBQ. In every single case, GEAR improves BOTH generation quality and reconstruction fidelity. 📈\n\n![QQ20260701-135801](https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/0Vjr54zE22PqJ5Y550asP.png)\n","html":"[5/6] 🔧 Generality Across Quantizers GEAR isn't just a one-trick pony for VQ-VAE. The soft-guidance mechanism is highly general! It works seamlessly across different quantizers: VQVAE, LFQ, and IBQ. In every single case, GEAR improves BOTH generation quality and reconstruction fidelity. 📈\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/0Vjr54zE22PqJ5Y550asP.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6a1aecae7401763b13902d21/0Vjr54zE22PqJ5Y550asP.png\" alt=\"QQ20260701-135801\"></a>\n","updatedAt":"2026-07-01T06:39:51.007Z","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8347354531288147},"editors":["BinLin203"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp"],"reactions":[],"isReport":false}},{"id":"6a44b655a85a14fefa0ee5f6","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-07-01T06:40:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"[6/6] 🔮 The Future of Visual AR Despite the reconstruction ceiling of discrete tokens, end-to-end VQ-AR is the key to unified, long-context generation. GEAR paves the way for applying LLM-style alignment (RLHF, DPO) directly to visual tokens! Dive into the paper and code below: 👇\n\n📄 Paper: https://arxiv.org/abs/2606.32039\n💻 Code: https://github.com/Tencent-Hunyuan/GEAR\n🤗 Models: https://huggingface.co/collections/BinLin203","html":"[6/6] 🔮 The Future of Visual AR Despite the reconstruction ceiling of discrete tokens, end-to-end VQ-AR is the key to unified, long-context generation. GEAR paves the way for applying LLM-style alignment (RLHF, DPO) directly to visual tokens! Dive into the paper and code below: 👇\n📄 Paper: <a href=\"https://arxiv.org/abs/2606.32039\" rel=\"nofollow\">https://arxiv.org/abs/2606.32039</a> 💻 Code: <a href=\"https://github.com/Tencent-Hunyuan/GEAR\" rel=\"nofollow\">https://github.com/Tencent-Hunyuan/GEAR</a> 🤗 Models: <a href=\"https://huggingface.co/collections/BinLin203\">https://huggingface.co/collections/BinLin203</a>\n","updatedAt":"2026-07-01T06:40:21.021Z","author":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","fullname":"Bin Lin","name":"BinLin203","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7675763368606567},"editors":["BinLin203"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp"],"reactions":[],"isReport":false}},{"id":"6a44c4d3d2b6f0adb3688ca6","author":{"_id":"6367a8175bb06007ea099b8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6367a8175bb06007ea099b8f/IjG7HyWyWRlVt_XwRbxRW.jpeg","fullname":"linbin","name":"LanguageBind","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":207,"isUserFollowing":false},"createdAt":"2026-07-01T07:42:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Visual generative models typically suffer from decoupled two-stage training. **GEAR** solves this by enabling **end-to-end joint training** of a tokenizer and an AR generator. It overcomes the non-differentiability of discrete tokens via a dual read-out mechanism: a hard branch for AR prediction and a soft branch to pass gradients back to the tokenizer. This shifts the semantic alignment burden to the AR model, guiding the tokenizer to produce easily predictable indices. Consequently, GEAR accelerates convergence by up to **10x**, improves spatial feature coherence, and demonstrates strong generalization across **different quantizers (VQVAE, LFQ, IBQ)** and text-to-image (T2I) generation.","html":"Visual generative models typically suffer from decoupled two-stage training. GEAR solves this by enabling end-to-end joint training of a tokenizer and an AR generator. It overcomes the non-differentiability of discrete tokens via a dual read-out mechanism: a hard branch for AR prediction and a soft branch to pass gradients back to the tokenizer. This shifts the semantic alignment burden to the AR model, guiding the tokenizer to produce easily predictable indices. Consequently, GEAR accelerates convergence by up to 10x, improves spatial feature coherence, and demonstrates strong generalization across different quantizers (VQVAE, LFQ, IBQ) and text-to-image (T2I) generation.\n","updatedAt":"2026-07-01T07:42:11.559Z","author":{"_id":"6367a8175bb06007ea099b8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6367a8175bb06007ea099b8f/IjG7HyWyWRlVt_XwRbxRW.jpeg","fullname":"linbin","name":"LanguageBind","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":207,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8348214030265808},"editors":["LanguageBind"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6367a8175bb06007ea099b8f/IjG7HyWyWRlVt_XwRbxRW.jpeg"],"reactions":[],"isReport":false}},{"id":"6a45c39ea0d17027b99196ae","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:49:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging](https://huggingface.co/papers/2605.30904) (2026)\n* [RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution](https://huggingface.co/papers/2605.21195) (2026)\n* [Vision Foundation Models as Generalist Tokenizers for Image Generation](https://huggingface.co/papers/2605.18390) (2026)\n* [Representation Forcing for Bottleneck-Free Unified Multimodal Models](https://huggingface.co/papers/2605.31604) (2026)\n* [Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning](https://huggingface.co/papers/2606.01935) (2026)\n* [Autoregressive Visual Generation Needs a Prologue](https://huggingface.co/papers/2605.06137) (2026)\n* [One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration](https://huggingface.co/papers/2605.21484) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.30904\">MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.21195\">RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18390\">Vision Foundation Models as Generalist Tokenizers for Image Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.31604\">Representation Forcing for Bottleneck-Free Unified Multimodal Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.01935\">Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.06137\">Autoregressive Visual Generation Needs a Prologue</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.21484\">One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code>\n","updatedAt":"2026-07-02T01:49:18.020Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7375643849372864},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.32039","authors":[{"_id":"6a44853741f04ae4d7ad9795","user":{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","isPro":false,"fullname":"Bin Lin","user":"BinLin203","type":"user","name":"BinLin203"},"name":"Bin Lin","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:44:26.823Z","hidden":false},{"_id":"6a44853741f04ae4d7ad9796","name":"Zheyuan Liu","hidden":false},{"_id":"6a44853741f04ae4d7ad9797","name":"Chenguo Lin","hidden":false},{"_id":"6a44853741f04ae4d7ad9798","name":"Sixiang Chen","hidden":false},{"_id":"6a44853741f04ae4d7ad9799","name":"Yunyang Ge","hidden":false},{"_id":"6a44853741f04ae4d7ad979a","user":{"_id":"64ecb174f22081b4ac7ca397","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ecb174f22081b4ac7ca397/PiAPtD_rbuhGOqfE6ZSIu.jpeg","isPro":true,"fullname":"Yunlong Lin","user":"LYL1015","type":"user","name":"LYL1015"},"name":"Yunlong Lin","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:44:29.146Z","hidden":false},{"_id":"6a44853741f04ae4d7ad979b","name":"Jianwei Zhang","hidden":false},{"_id":"6a44853741f04ae4d7ad979c","name":"Miles Yang","hidden":false},{"_id":"6a44853741f04ae4d7ad979d","name":"Zhao Zhong","hidden":false},{"_id":"6a44853741f04ae4d7ad979e","name":"Liefeng Bo","hidden":false},{"_id":"6a44853741f04ae4d7ad979f","name":"Li Yuan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6367a8175bb06007ea099b8f/BltnPpNkiBJkSwGODiUW2.png","https://cdn-uploads.huggingface.co/production/uploads/6367a8175bb06007ea099b8f/0dH0yutHkngaZzz2KiBOt.png","https://cdn-uploads.huggingface.co/production/uploads/6367a8175bb06007ea099b8f/Bv7lS-Lvhgo2l5g_bKl_9.png","https://cdn-uploads.huggingface.co/production/uploads/6367a8175bb06007ea099b8f/jVvEEHiYKn3YK02V7sQ3m.png","https://cdn-uploads.huggingface.co/production/uploads/6367a8175bb06007ea099b8f/Mg0_qxIhdUHHVWh7hkU0c.png"],"publishedAt":"2026-06-30T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"GEAR: Guided End-to-End AutoRegression for Image Synthesis","submittedOnDailyBy":{"_id":"6367a8175bb06007ea099b8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6367a8175bb06007ea099b8f/IjG7HyWyWRlVt_XwRbxRW.jpeg","isPro":false,"fullname":"linbin","user":"LanguageBind","type":"user","name":"LanguageBind"},"summary":"Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.","upvotes":28,"discussionId":"6a44853741f04ae4d7ad97a0","projectPage":"https://linb203.github.io/gear","githubRepo":"https://github.com/Tencent-Hunyuan/GEAR","githubRepoAddedBy":"user","ai_summary":"GEAR trains a vector-quantized tokenizer and autoregressive generator jointly end-to-end using representation alignment, overcoming non-differentiability issues through a dual read-out approach that improves convergence speed and feature quality.","ai_keywords":["vector-quantized","autoregressive","representation alignment","straight-through estimator","codebook assignment","next-token prediction","DINOv2","ImageNet","gFID","VQVAE","LFQ","IBQ","text-to-image generation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":47,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a1aecae7401763b13902d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a1aecae7401763b13902d21/Xnvn-it64TD036FWorAWQ.webp","isPro":false,"fullname":"Bin Lin","user":"BinLin203","type":"user"},{"_id":"64ecb174f22081b4ac7ca397","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ecb174f22081b4ac7ca397/PiAPtD_rbuhGOqfE6ZSIu.jpeg","isPro":true,"fullname":"Yunlong Lin","user":"LYL1015","type":"user"},{"_id":"67b43fa6de85e8efd60d25b7","avatarUrl":"/avatars/7841ea817eddf17bec4180f9a62736c1.svg","isPro":false,"fullname":"Danyang Li","user":"February30","type":"user"},{"_id":"646de6402fd5a8eb8c518aa6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646de6402fd5a8eb8c518aa6/HYWb8-fT1kTm-ROBr1-0X.jpeg","isPro":false,"fullname":"yunyangge","user":"yunyangge","type":"user"},{"_id":"63e992cdccae1fe5c6222f84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e992cdccae1fe5c6222f84/IvksSUf2DENfUwCZSmNPd.jpeg","isPro":true,"fullname":"Guowei Xu","user":"Xkev","type":"user"},{"_id":"6678e670a2873f979b492c5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6678e670a2873f979b492c5b/qlhNpwbbfL00SdpnFhUnn.png","isPro":false,"fullname":"HaoLi","user":"OzymandisLi","type":"user"},{"_id":"6688f94517f8d8f81c2def5f","avatarUrl":"/avatars/1064ccd62539a8ca2ec86fc87c818296.svg","isPro":false,"fullname":"hexianyi","user":"pkuhexianyi","type":"user"},{"_id":"6449d934df4e6cb7eaef85f3","avatarUrl":"/avatars/61b18ecf99c3fe268cb9f752066a9b5e.svg","isPro":false,"fullname":"Shaodong Wang","user":"shaodong","type":"user"},{"_id":"64560c21babbbbd3486df362","avatarUrl":"/avatars/69a37786be8d5ec964d25a565afa3103.svg","isPro":false,"fullname":"Xu Huang","user":"Savoia","type":"user"},{"_id":"63468720dd6d90d82ccf3450","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63468720dd6d90d82ccf3450/tVBFlmZNz8FRMkOrDaDID.jpeg","isPro":false,"fullname":"YSH","user":"BestWishYsh","type":"user"},{"_id":"66135a5e50350afe76beebce","avatarUrl":"/avatars/370a4b83949355feb050c2cb0425c264.svg","isPro":false,"fullname":"yl2488","user":"yl2488","type":"user"},{"_id":"66040f416cc9c18dd1a00316","avatarUrl":"/avatars/984e86fd9a06702b1b08d035c2bb9f7b.svg","isPro":false,"fullname":"Kaiwei Che","user":"Richard-CKW","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.32039.md","query":{}}">

Papers

arxiv:2606.32039

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Published on Jun 30

· Submitted by

linbin on Jul 1

Tencent Hunyuan

Upvote

Authors:

Bin Lin ,

Yunlong Lin ,

Abstract

GEAR trains a vector-quantized tokenizer and autoregressive generator jointly end-to-end using representation alignment, overcoming non-differentiability issues through a dual read-out approach that improves convergence speed and feature quality.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.

View arXiv page View PDF Project page GitHub 47 Add to collection

Community

BinLin203

Paper author about 19 hours ago

•

edited about 19 hours ago

📄 Paper: https://arxiv.org/abs/2606.32039
💻 Code: https://github.com/Tencent-Hunyuan/GEAR
🤗 Models: https://huggingface.co/collections/BinLin203
🏠 Homepage: https://linb203.github.io/gear

[1/6] 🚨 Stop freezing your tokenizers! Visual Autoregressive (AR) models are stuck in the 2-stage era. We present GEAR ⚙️ (Guided End-to-End AutoRegression). By jointly training the VQ tokenizer & AR generator, we achieve up to 10x faster convergence than strong baselines like LlamaGen-REPA! 🧵👇

BinLin203

Paper author about 19 hours ago

[2/6] 🧠 The Bottleneck & The Solution Why is end-to-end discrete AR so hard? The non-differentiable argmax! Naive Straight-Through Estimator (STE) causes the codebook to collapse. 💥 GEAR solves this with a brilliant Dual Read-out mechanism: 1️⃣ Hard branch: Trains AR with next-token prediction. 2️⃣ Soft branch: A differentiable path carrying alignment loss back to guide only the tokenizer.

BinLin203

Paper author about 19 hours ago

[3/6] 🤯 The Mind-Blowing Finding (Representation Shift) Here is the most surprising part: Unlike diffusion models (where end-to-end training makes latents more semantic), GEAR does the exact OPPOSITE! The tokenizer becomes less DINOv2-like, reorganizing for pure predictability (lower entropy). The semantic alignment burden shifts entirely to the AR model's hidden states!

BinLin203

Paper author about 19 hours ago

[4/6] 🚀 The Results: Faster & Better The numbers speak for themselves. On ImageNet 256x256, GEAR consistently beats baselines across B/L/XL scales (gFID drops to 2.52). On Text-to-Image (GPIC), it reaches the baseline's final REPA alignment loss 11.1x faster and NTP loss 2.5x faster! ⚡️

BinLin203

Paper author about 19 hours ago

[5/6] 🔧 Generality Across Quantizers GEAR isn't just a one-trick pony for VQ-VAE. The soft-guidance mechanism is highly general! It works seamlessly across different quantizers: VQVAE, LFQ, and IBQ. In every single case, GEAR improves BOTH generation quality and reconstruction fidelity. 📈

BinLin203

Paper author about 19 hours ago

[6/6] 🔮 The Future of Visual AR Despite the reconstruction ceiling of discrete tokens, end-to-end VQ-AR is the key to unified, long-context generation. GEAR paves the way for applying LLM-style alignment (RLHF, DPO) directly to visual tokens! Dive into the paper and code below: 👇

📄 Paper: https://arxiv.org/abs/2606.32039
💻 Code: https://github.com/Tencent-Hunyuan/GEAR
🤗 Models: https://huggingface.co/collections/BinLin203

LanguageBind

Paper submitter about 18 hours ago

Visual generative models typically suffer from decoupled two-stage training. GEAR solves this by enabling end-to-end joint training of a tokenizer and an AR generator. It overcomes the non-differentiability of discrete tokens via a dual read-out mechanism: a hard branch for AR prediction and a soft branch to pass gradients back to the tokenizer. This shifts the semantic alignment burden to the AR model, guiding the tokenizer to produce easily predictable indices. Consequently, GEAR accelerates convergence by up to 10x, improves spatial feature coherence, and demonstrates strong generalization across different quantizers (VQVAE, LFQ, IBQ) and text-to-image (T2I) generation.