Hugging Face Daily Papers · · 5 min read

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation</p>\n","updatedAt":"2026-07-01T02:16:58.846Z","author":{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","fullname":"PHAM Trung Kien","name":"TK3105","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4470033645629883},"editors":["TK3105"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg"],"reactions":[{"reaction":"👍","users":["zhilichen"],"count":1}],"isReport":false}},{"id":"6a45c3247aee617a13ac5ebf","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis](https://huggingface.co/papers/2606.09098) (2026)\n* [Native Audio-Visual Alignment for Generation](https://huggingface.co/papers/2605.30073) (2026)\n* [MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning](https://huggingface.co/papers/2606.25225) (2026)\n* [WavFlow: Audio Generation in Waveform Space](https://huggingface.co/papers/2605.18749) (2026)\n* [Vision Foundation Models as Generalist Tokenizers for Image Generation](https://huggingface.co/papers/2605.18390) (2026)\n* [EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement](https://huggingface.co/papers/2606.02739) (2026)\n* [Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation](https://huggingface.co/papers/2605.17488) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.09098\">HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30073\">Native Audio-Visual Alignment for Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.25225\">MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18749\">WavFlow: Audio Generation in Waveform Space</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18390\">Vision Foundation Models as Generalist Tokenizers for Image Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.02739\">EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17488\">Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:47:16.685Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6747430562973022},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30811","authors":[{"_id":"6a4476e241f04ae4d7ad96e2","name":"Kien T. Pham","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e3","name":"I Chieh Chen","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e4","name":"Qifeng Chen","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e5","name":"Long Chen","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation","submittedOnDailyBy":{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","isPro":false,"fullname":"PHAM Trung Kien","user":"TK3105","type":"user","name":"TK3105"},"summary":"Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.","upvotes":3,"discussionId":"6a4476e341f04ae4d7ad96e6","projectPage":"https://hkust-longgroup.github.io/AVTok","githubRepo":"https://github.com/hkust-longgroup/AVTok","githubRepoAddedBy":"user","ai_summary":"AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.","ai_keywords":["unified tokenizer","audio-video generation","dual-stream transformer","shared encoder-decoder","modal-specific learnable queries","one-dimensional latent representation","hierarchical training strategy","audio-video reconstruction","downstream pipelines","multimodal models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6892d6ac1f452f5f4c2b0d42","avatarUrl":"/avatars/cf374bd2c14e95c2ca88553e3f2638ca.svg","isPro":false,"fullname":"TK","user":"tkp3105","type":"user"},{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","isPro":false,"fullname":"PHAM Trung Kien","user":"TK3105","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30811.md","query":{}}">
Papers
arxiv:2606.30811

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Published on Jun 29
· Submitted by
PHAM Trung Kien
on Jul 1
Authors:
,
,
,

Abstract

AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.

Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.

Community

Paper submitter about 24 hours ago

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.30811
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.30811 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.30811 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.30811 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers