Hugging Face Daily Papers · · 6 min read

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Existing spoken language models (SLMs) typically use a fixed speech-token frame rate (for example, 25 Hz or 12.5 Hz). This fixed-rate design cannot adapt to time-varying speech complexity and does not offer a direct speed-quality trade-off at inference time. We introduce FlexiSLM, the first SLM that supports dynamic and controllable frame rates on both speech input and output. A single trained model can be steered from 12.5 Hz down to 4.0 Hz without retraining. Open-source release is coming soon!</p>\n","updatedAt":"2026-07-01T12:06:25.848Z","author":{"_id":"6635a711a5243c9638f5e4df","avatarUrl":"/avatars/08651622fc1fd5089551b510be8c4530.svg","fullname":"Jiaqi Li","name":"jiaqili3","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8988864421844482},"editors":["jiaqili3"],"editorAvatarUrls":["/avatars/08651622fc1fd5089551b510be8c4530.svg"],"reactions":[],"isReport":false}},{"id":"6a45c3171e3164cc6f93c427","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation](https://huggingface.co/papers/2606.12199) (2026)\n* [Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation](https://huggingface.co/papers/2606.30944) (2026)\n* [Probing Low Frame Rate Degradation in Neural Audio Codecs](https://huggingface.co/papers/2606.16969) (2026)\n* [VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing](https://huggingface.co/papers/2605.06765) (2026)\n* [Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM](https://huggingface.co/papers/2605.05927) (2026)\n* [BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM](https://huggingface.co/papers/2606.14528) (2026)\n* [AuRA: Internalizing Audio Understanding into LLMs as LoRA](https://huggingface.co/papers/2606.11033) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2606.12199\">Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.30944\">Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.16969\">Probing Low Frame Rate Degradation in Neural Audio Codecs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.06765\">VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.05927\">Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.14528\">BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.11033\">AuRA: Internalizing Audio Understanding into LLMs as LoRA</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:47:03.016Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7339105010032654},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.31247","authors":[{"_id":"6a4501b44f1dd35e48fb8c69","name":"Jiaqi Li","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c6a","name":"Chaoren Wang","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c6b","name":"Xiaohai Tian","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c6c","name":"Mingjie Chen","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c6d","name":"Xinyu Liang","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c6e","name":"Xu Li","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c6f","name":"Yufan Lin","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c70","name":"Junwen Qiu","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c71","name":"Jun Zhang","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c72","name":"Lu Lu","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c73","name":"Haizhou Li","hidden":false},{"_id":"6a4501b44f1dd35e48fb8c74","name":"Zhizheng Wu","hidden":false}],"publishedAt":"2026-06-30T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model","submittedOnDailyBy":{"_id":"6635a711a5243c9638f5e4df","avatarUrl":"/avatars/08651622fc1fd5089551b510be8c4530.svg","isPro":false,"fullname":"Jiaqi Li","user":"jiaqili3","type":"user","name":"jiaqili3"},"summary":"Spoken language models (SLMs) extend LLMs to speech input and output. Existing SLMs represent speech at fixed frame rates (e.g., 25 or 12.5 Hz), ignoring the time-varying information density of speech and offering no flexibility to trade off quality for speed at inference time. Recent audio tokenizer research has proposed dynamic frame rate speech coding, which exploits this non-uniformity and enables two new capabilities: very low average frame rates and frame rate controllability. However, this technique has not yet been applied to SLMs. We introduce Flexible Spoken Language Model (FlexiSLM), the first SLM that supports dynamic and controllable frame rates on both speech input and output. Using dynamic frame rate representations, FlexiSLM outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at its high-quality operating points. We further verify that FlexiSLM can be accurately steered down to 4.0 Hz; at 6.25 Hz, it roughly halves inference time relative to 12.5 Hz while retaining strong speech-to-speech quality. Audio samples are available at https://flexislm.github.io .","upvotes":0,"discussionId":"6a4501b54f1dd35e48fb8c75","projectPage":"https://flexislm.github.io","githubRepo":"https://github.com/AmphionTeam/FlexiSLM","githubRepoAddedBy":"user","ai_summary":"Flexible Spoken Language Model (FlexiSLM) introduces dynamic frame rate capabilities for speech input and output, achieving superior performance over fixed-frame-rate models while enabling controllable inference speed.","ai_keywords":["spoken language models","audio tokenizer","dynamic frame rate","speech coding","inference speed","speech-to-speech quality"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"6562a925a72f05d2eaac5687","name":"amphion","fullname":"Amphion","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60486b2cec955c4994bb6249/30H_QYVOsbkGBI83pYszQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"6562a925a72f05d2eaac5687","name":"amphion","fullname":"Amphion","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60486b2cec955c4994bb6249/30H_QYVOsbkGBI83pYszQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.31247.md","query":{}}">
Papers
arxiv:2606.31247

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Published on Jun 30
· Submitted by
Jiaqi Li
on Jul 1
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Flexible Spoken Language Model (FlexiSLM) introduces dynamic frame rate capabilities for speech input and output, achieving superior performance over fixed-frame-rate models while enabling controllable inference speed.

Spoken language models (SLMs) extend LLMs to speech input and output. Existing SLMs represent speech at fixed frame rates (e.g., 25 or 12.5 Hz), ignoring the time-varying information density of speech and offering no flexibility to trade off quality for speed at inference time. Recent audio tokenizer research has proposed dynamic frame rate speech coding, which exploits this non-uniformity and enables two new capabilities: very low average frame rates and frame rate controllability. However, this technique has not yet been applied to SLMs. We introduce Flexible Spoken Language Model (FlexiSLM), the first SLM that supports dynamic and controllable frame rates on both speech input and output. Using dynamic frame rate representations, FlexiSLM outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at its high-quality operating points. We further verify that FlexiSLM can be accurately steered down to 4.0 Hz; at 6.25 Hz, it roughly halves inference time relative to 12.5 Hz while retaining strong speech-to-speech quality. Audio samples are available at https://flexislm.github.io .

Community

Paper submitter about 14 hours ago

Existing spoken language models (SLMs) typically use a fixed speech-token frame rate (for example, 25 Hz or 12.5 Hz). This fixed-rate design cannot adapt to time-varying speech complexity and does not offer a direct speed-quality trade-off at inference time. We introduce FlexiSLM, the first SLM that supports dynamic and controllable frame rates on both speech input and output. A single trained model can be steered from 12.5 Hz down to 4.0 Hz without retraining. Open-source release is coming soon!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.31247
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.31247 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.31247 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.31247 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers