Hugging Face Daily Papers · June 9, 2026 · 4 min read

Liberating LLM Capabilities in Full-Duplex Speech Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

LWS is a simple “free lunch” for full-duplex speech models: without changing the model architecture, we add a visible writing channel through a token schema, allowing the model to speak in real time while also producing text-native outputs such as code, tables, derivations, and structured reasoning.\nProject page: <a href=\"https://royalzhang.com/project/lws-page/\" rel=\"nofollow\">https://royalzhang.com/project/lws-page/</a>\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/64c86bea4524c2aea7ff784e/qb5U7byReiB-IUi5H8Ck-.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/64c86bea4524c2aea7ff784e/qb5U7byReiB-IUi5H8Ck-.png\" alt=\"screenshot-20260609-193609\"></a>\n","updatedAt":"2026-06-09T12:02:45.916Z","author":{"_id":"64c86bea4524c2aea7ff784e","avatarUrl":"/avatars/239294938df32c161135a3089c674043.svg","fullname":"zly","name":"zly-idleness","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8166653513908386},"editors":["zly-idleness"],"editorAvatarUrls":["/avatars/239294938df32c161135a3089c674043.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.07547","authors":[{"_id":"6a27f2cb770b2a72dc78883c","user":{"_id":"64c86bea4524c2aea7ff784e","avatarUrl":"/avatars/239294938df32c161135a3089c674043.svg","isPro":false,"fullname":"zly","user":"zly-idleness","type":"user","name":"zly-idleness"},"name":"Luoyuan Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:40:24.605Z","hidden":false},{"_id":"6a27f2cb770b2a72dc78883d","name":"Bokai Xu","hidden":false},{"_id":"6a27f2cb770b2a72dc78883e","name":"Junbo Cui","hidden":false},{"_id":"6a27f2cb770b2a72dc78883f","name":"Weiyue Sun","hidden":false},{"_id":"6a27f2cb770b2a72dc788840","name":"Yingjing Xu","hidden":false},{"_id":"6a27f2cb770b2a72dc788841","name":"Hanyu Liu","hidden":false},{"_id":"6a27f2cb770b2a72dc788842","name":"Yuan Yao","hidden":false}],"publishedAt":"2026-05-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Liberating LLM Capabilities in Full-Duplex Speech Models","submittedOnDailyBy":{"_id":"64c86bea4524c2aea7ff784e","avatarUrl":"/avatars/239294938df32c161135a3089c674043.svg","isPro":false,"fullname":"zly","user":"zly-idleness","type":"user","name":"zly-idleness"},"summary":"Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.","upvotes":7,"discussionId":"6a27f2cb770b2a72dc788843","projectPage":"https://royalzhang.com/project/lws-page/","githubRepo":"https://github.com/zly-idleness/lws_demo","githubRepoAddedBy":"user","ai_summary":"A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks.","ai_keywords":["autoregressive LLM","causal attention","Token Schema","full-duplex interaction","VoiceBench","URO-Bench","cognitive annotations","speech-based large language models","text-first paradigm","real-time responsiveness"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c86bea4524c2aea7ff784e","avatarUrl":"/avatars/239294938df32c161135a3089c674043.svg","isPro":false,"fullname":"zly","user":"zly-idleness","type":"user"},{"_id":"6a28039c00eb17cd9625fdbe","avatarUrl":"/avatars/719542594ff2366d50debd765e4e00e4.svg","isPro":false,"fullname":"v bn","user":"exxpre","type":"user"},{"_id":"6415818a986557e8cac252bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6415818a986557e8cac252bf/T4u9qjRt8P4clF4nOTA4W.jpeg","isPro":false,"fullname":"Bokai Xu","user":"bokesyo","type":"user"},{"_id":"66f809dd60ff0bbc5d40c8f1","avatarUrl":"/avatars/7a71995883f22429150052f1d63c7757.svg","isPro":false,"fullname":"yancheng Long","user":"Lorangan","type":"user"},{"_id":"6a2812be7ddace51a1ab02a4","avatarUrl":"/avatars/32f4d54e4a85394f45793ba46f291ae9.svg","isPro":false,"fullname":"jgy","user":"zimberba","type":"user"},{"_id":"62ce9597a3a23014aca4f035","avatarUrl":"/avatars/8664880bbd8d4566db6ec45eeaf26e44.svg","isPro":false,"fullname":"Hanyu Liu","user":"lingxi","type":"user"},{"_id":"6333b9d774ffb1a32c97da3d","avatarUrl":"/avatars/554659794c26c4b1f5c2f5ed52e2a462.svg","isPro":false,"fullname":"Weiyue Sun","user":"sunwy","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.07547.md"}">

Papers

arxiv:2606.07547

Liberating LLM Capabilities in Full-Duplex Speech Models

Published on May 4

· Submitted by

zly on Jun 9

Upvote

Authors:

Luoyuan Zhang ,

Abstract

A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

zly-idleness

Paper author Paper submitter about 7 hours ago

Project page: https://royalzhang.com/project/lws-page/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.07547

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.07547 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.07547 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.07547 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Liberating LLM Capabilities in Full-Duplex Speech Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers