Hugging Face Daily Papers · July 1, 2026 · 5 min read

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation\n","updatedAt":"2026-07-01T02:16:58.846Z","author":{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","fullname":"PHAM Trung Kien","name":"TK3105","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4470033645629883},"editors":["TK3105"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg"],"reactions":[{"reaction":"👍","users":["zhilichen"],"count":1}],"isReport":false}},{"id":"6a45c3247aee617a13ac5ebf","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis](https://huggingface.co/papers/2606.09098) (2026)\n* [Native Audio-Visual Alignment for Generation](https://huggingface.co/papers/2605.30073) (2026)\n* [MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning](https://huggingface.co/papers/2606.25225) (2026)\n* [WavFlow: Audio Generation in Waveform Space](https://huggingface.co/papers/2605.18749) (2026)\n* [Vision Foundation Models as Generalist Tokenizers for Image Generation](https://huggingface.co/papers/2605.18390) (2026)\n* [EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement](https://huggingface.co/papers/2606.02739) (2026)\n* [Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation](https://huggingface.co/papers/2605.17488) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.09098\">HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30073\">Native Audio-Visual Alignment for Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.25225\">MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18749\">WavFlow: Audio Generation in Waveform Space</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18390\">Vision Foundation Models as Generalist Tokenizers for Image Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.02739\">EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17488\">Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code>\n","updatedAt":"2026-07-02T01:47:16.685Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6747430562973022},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30811","authors":[{"_id":"6a4476e241f04ae4d7ad96e2","name":"Kien T. Pham","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e3","name":"I Chieh Chen","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e4","name":"Qifeng Chen","hidden":false},{"_id":"6a4476e241f04ae4d7ad96e5","name":"Long Chen","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation","submittedOnDailyBy":{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","isPro":false,"fullname":"PHAM Trung Kien","user":"TK3105","type":"user","name":"TK3105"},"summary":"Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.","upvotes":3,"discussionId":"6a4476e341f04ae4d7ad96e6","projectPage":"https://hkust-longgroup.github.io/AVTok","githubRepo":"https://github.com/hkust-longgroup/AVTok","githubRepoAddedBy":"user","ai_summary":"AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.","ai_keywords":["unified tokenizer","audio-video generation","dual-stream transformer","shared encoder-decoder","modal-specific learnable queries","one-dimensional latent representation","hierarchical training strategy","audio-video reconstruction","downstream pipelines","multimodal models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6892d6ac1f452f5f4c2b0d42","avatarUrl":"/avatars/cf374bd2c14e95c2ca88553e3f2638ca.svg","isPro":false,"fullname":"TK","user":"tkp3105","type":"user"},{"_id":"65899fc5ce38d143c4638da4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65899fc5ce38d143c4638da4/TE-CGpM08JWsbHTjwKcaK.jpeg","isPro":false,"fullname":"PHAM Trung Kien","user":"TK3105","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30811.md","query":{}}">

Papers

arxiv:2606.30811

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Published on Jun 29

· Submitted by

PHAM Trung Kien on Jul 1

Upvote

Authors:

Abstract

AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

TK3105

Paper submitter about 24 hours ago

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

librarian-bot

14 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.30811

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.30811 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.30811 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.30811 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers