AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation</p>\n","updatedAt":"2026-07-01T02:16:58.846Z","author":{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","fullname":"PHAM Trung Kien","name":"TK3105","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4470033645629883},"editors":["TK3105"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg"],"reactions":[{"reaction":"👍","users":["zhilichen"],"count":1}],"isReport":false}},{"id":"6a45c3247aee617a13ac5ebf","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis](https://huggingface.co/papers/2606.09098) (2026)\n* [Native Audio-Visual Alignment for Generation](https://huggingface.co/papers/2605.30073) (2026)\n* [MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning](https://huggingface.co/papers/2606.25225) (2026)\n* [WavFlow: Audio Generation in Waveform Space](https://huggingface.co/papers/2605.18749) (2026)\n* [Vision Foundation Models as Generalist Tokenizers for Image Generation](https://huggingface.co/papers/2605.18390) (2026)\n* [EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement](https://huggingface.co/papers/2606.02739) (2026)\n* [Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation](https://huggingface.co/papers/2605.17488) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.09098\">HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30073\">Native Audio-Visual Alignment for Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.25225\">MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18749\">WavFlow: Audio Generation in Waveform Space</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18390\">Vision Foundation Models as Generalist Tokenizers for Image Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.02739\">EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17488\">Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:47:16.685Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6747430562973022},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30811","authors":[{"_id":"6a4476e241f04ae4d7ad96e2","name":"Kien T. Pham","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e3","name":"I Chieh Chen","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e4","name":"Qifeng Chen","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e5","name":"Long Chen","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation","submittedOnDailyBy":{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","isPro":false,"fullname":"PHAM Trung Kien","user":"TK3105","type":"user","name":"TK3105"},"summary":"Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.","upvotes":3,"discussionId":"6a4476e341f04ae4d7ad96e6","projectPage":"https://hkust-longgroup.github.io/AVTok","githubRepo":"https://github.com/hkust-longgroup/AVTok","githubRepoAddedBy":"user","ai_summary":"AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.","ai_keywords":["unified tokenizer","audio-video generation","dual-stream transformer","shared encoder-decoder","modal-specific learnable queries","one-dimensional latent representation","hierarchical training strategy","audio-video reconstruction","downstream pipelines","multimodal models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6892d6ac1f452f5f4c2b0d42","avatarUrl":"/avatars/cf374bd2c14e95c2ca88553e3f2638ca.svg","isPro":false,"fullname":"TK","user":"tkp3105","type":"user"},{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","isPro":false,"fullname":"PHAM Trung Kien","user":"TK3105","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30811.md","query":{}}">
AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation
Abstract
AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.
Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.
Community
AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.30811 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.30811 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.30811 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.