Hugging Face Daily Papers · · 6 min read

Interleaved Speech Language Models Latently Work In Text

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Speech language models (SLMs) have been extensively studied, with the common paradigm incorporating text data and pre-trained text LMs. A leading approach is speech-text interleaving in which models are trained over sequences containing both speech and text tokens, aiming to boost even speech-only capabilities. Yet the way these two modalities interact in the model latent space remains unclear. In this work, we analyze interleaved speech-text LMs from different model families and sizes through the scope of the logit lens to provide such insight. We reveal that these models go through an implicit transcription phase in which the text token of the spoken word becomes decodable in intermediate layers, despite not being trained for speech recognition. The transcription of the word appears as one of the top candidate words for as much as 77% of the data. Following this stage, the models proceed to predict the next word in the text space before transforming back to the speech domain. We finally analyze the role of interleaving data, and initializing from text LMs in eliciting this behavior, as well as seeing how this correlates with spoken knowledge abilities. Our analysis sheds light on the internal mechanisms underlying the relationship between speech and text modalities and could shape SLM optimization.<br><a href=\"https://cdn-uploads.huggingface.co/production/uploads/68175812a178d73748608431/i7X1KkHjOtjYFEx_j9oYJ.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/68175812a178d73748608431/i7X1KkHjOtjYFEx_j9oYJ.png\" alt=\"fig_1\"></a></p>\n","updatedAt":"2026-06-30T06:36:41.536Z","author":{"_id":"68175812a178d73748608431","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68175812a178d73748608431/YORFxk3gvPi0GUbp2coaa.jpeg","fullname":"Talia Sternberg","name":"TaliaSternberg","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9383733868598938},"editors":["TaliaSternberg"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/68175812a178d73748608431/YORFxk3gvPi0GUbp2coaa.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.22473","authors":[{"_id":"6a3cd1b4f3facdb67e9ff297","user":{"_id":"68175812a178d73748608431","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68175812a178d73748608431/YORFxk3gvPi0GUbp2coaa.jpeg","isPro":false,"fullname":"Talia Sternberg","user":"TaliaSternberg","type":"user","name":"TaliaSternberg"},"name":"Talia Sternberg","status":"claimed_verified","statusLastChangedAt":"2026-06-29T13:24:03.131Z","hidden":false},{"_id":"6a3cd1b4f3facdb67e9ff298","user":{"_id":"66b9bc2dacdbc1d0b39c3b50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/hwR0pVfP_E8XjimXIxDOU.jpeg","isPro":false,"fullname":"Gallil Maimon","user":"gallilmaimon","type":"user","name":"gallilmaimon"},"name":"Gallil Maimon","status":"claimed_verified","statusLastChangedAt":"2026-06-27T15:25:03.603Z","hidden":false},{"_id":"6a3cd1b4f3facdb67e9ff299","name":"Yossi Adi","hidden":false}],"publishedAt":"2026-06-21T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"Interleaved Speech Language Models Latently Work In Text","submittedOnDailyBy":{"_id":"68175812a178d73748608431","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68175812a178d73748608431/YORFxk3gvPi0GUbp2coaa.jpeg","isPro":false,"fullname":"Talia Sternberg","user":"TaliaSternberg","type":"user","name":"TaliaSternberg"},"summary":"Speech language models (SLMs) have been extensively studied, with the common paradigm incorporating text data and pre-trained text LMs. A leading approach is speech-text interleaving in which models are trained over sequences containing both speech and text tokens, aiming to boost even speech-only capabilities. Yet the way these two modalities interact in the model latent space remains unclear. In this work, we analyze interleaved speech-text LMs from different model families and sizes through the scope of the logit lens to provide such insight. We reveal that these models go through an implicit transcription phase in which the text token of the spoken word becomes decodable in intermediate layers, despite not being trained for speech recognition. The transcription of the word appears as one of the top candidate words for as much as 77\\% of the data. Following this stage, the models proceed to predict the next word in the text space before transforming back to the speech domain. We finally analyze the role of interleaving data, and initializing from text LMs in eliciting this behavior, as well as seeing how this correlates with spoken knowledge abilities. Our analysis sheds light on the internal mechanisms underlying the relationship between speech and text modalities and could shape SLM optimization.","upvotes":10,"discussionId":"6a3cd1b5f3facdb67e9ff29a","projectPage":"https://pages.cs.huji.ac.il/adiyoss-lab/slm_work_in_text/","ai_summary":"Interleaved speech-text language models exhibit an implicit transcription phase where text tokens become decodable in intermediate layers, followed by text-based prediction before speech domain transformation.","ai_keywords":["speech language models","speech-text interleaving","logit lens","intermediate layers","text token","speech recognition","spoken knowledge abilities"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"65157bc51e7b9224c9c6d460","name":"HUJI-IL","fullname":"The Hebrew University of Jerusalem","avatar":"https://www.gravatar.com/avatar/fbf7c0844f4246fadde2c5ef9867ccaf?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66b9bc2dacdbc1d0b39c3b50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/hwR0pVfP_E8XjimXIxDOU.jpeg","isPro":false,"fullname":"Gallil Maimon","user":"gallilmaimon","type":"user"},{"_id":"658436f5c73f74776b19198a","avatarUrl":"/avatars/3f1d76af6fc0405d663c9294318fe83e.svg","isPro":false,"fullname":"Iddo Yosha","user":"iyosha","type":"user"},{"_id":"65c43b8e61c8e6d06ab4bd41","avatarUrl":"/avatars/c97b98252ec3a0e27ea4e561fc901042.svg","isPro":false,"fullname":"NivCohen","user":"NivC","type":"user"},{"_id":"64ce81faa2e7f9ff61d1408a","avatarUrl":"/avatars/47782a098e817dacdf759cbaa324264a.svg","isPro":false,"fullname":"Daria Lioubahsevski","user":"daria-lioub","type":"user"},{"_id":"64b7b7b38ba7d6c922d753d6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7b7b38ba7d6c922d753d6/rt0thjYa84VZHy1BEcW4p.jpeg","isPro":false,"fullname":"Amit Roth","user":"MajoRoth","type":"user"},{"_id":"64db6c03a34448aee6723546","avatarUrl":"/avatars/99a6a9d17dd5d56b64d5eeb03fd5b0d6.svg","isPro":false,"fullname":"Arnon tu","user":"arnontu","type":"user"},{"_id":"646d239f4220471ca0c6471c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646d239f4220471ca0c6471c/sRwzko8XEUVCkeD7jXceH.jpeg","isPro":false,"fullname":"Guy Yariv","user":"GuyYariv","type":"user"},{"_id":"66c8916afafc0fc87cd6e9ca","avatarUrl":"/avatars/627cabfbe5fba7393c5e4bba4aa3f07f.svg","isPro":false,"fullname":"Niv Eckhaus","user":"nive-huji","type":"user"},{"_id":"68175812a178d73748608431","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68175812a178d73748608431/YORFxk3gvPi0GUbp2coaa.jpeg","isPro":false,"fullname":"Talia Sternberg","user":"TaliaSternberg","type":"user"},{"_id":"6613206695c6b73065c0bcc8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/U6igoyYyfFU4p0sAgzcrw.jpeg","isPro":false,"fullname":"alpaim","user":"alpaim","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"65157bc51e7b9224c9c6d460","name":"HUJI-IL","fullname":"The Hebrew University of Jerusalem","avatar":"https://www.gravatar.com/avatar/fbf7c0844f4246fadde2c5ef9867ccaf?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.22473.md","query":{}}">
Papers
arxiv:2606.22473

Interleaved Speech Language Models Latently Work In Text

Published on Jun 21
· Submitted by
Talia Sternberg
on Jun 30

Abstract

Interleaved speech-text language models exhibit an implicit transcription phase where text tokens become decodable in intermediate layers, followed by text-based prediction before speech domain transformation.

Speech language models (SLMs) have been extensively studied, with the common paradigm incorporating text data and pre-trained text LMs. A leading approach is speech-text interleaving in which models are trained over sequences containing both speech and text tokens, aiming to boost even speech-only capabilities. Yet the way these two modalities interact in the model latent space remains unclear. In this work, we analyze interleaved speech-text LMs from different model families and sizes through the scope of the logit lens to provide such insight. We reveal that these models go through an implicit transcription phase in which the text token of the spoken word becomes decodable in intermediate layers, despite not being trained for speech recognition. The transcription of the word appears as one of the top candidate words for as much as 77\% of the data. Following this stage, the models proceed to predict the next word in the text space before transforming back to the speech domain. We finally analyze the role of interleaving data, and initializing from text LMs in eliciting this behavior, as well as seeing how this correlates with spoken knowledge abilities. Our analysis sheds light on the internal mechanisms underlying the relationship between speech and text modalities and could shape SLM optimization.

Community

Paper author Paper submitter about 18 hours ago

Speech language models (SLMs) have been extensively studied, with the common paradigm incorporating text data and pre-trained text LMs. A leading approach is speech-text interleaving in which models are trained over sequences containing both speech and text tokens, aiming to boost even speech-only capabilities. Yet the way these two modalities interact in the model latent space remains unclear. In this work, we analyze interleaved speech-text LMs from different model families and sizes through the scope of the logit lens to provide such insight. We reveal that these models go through an implicit transcription phase in which the text token of the spoken word becomes decodable in intermediate layers, despite not being trained for speech recognition. The transcription of the word appears as one of the top candidate words for as much as 77% of the data. Following this stage, the models proceed to predict the next word in the text space before transforming back to the speech domain. We finally analyze the role of interleaving data, and initializing from text LMs in eliciting this behavior, as well as seeing how this correlates with spoken knowledge abilities. Our analysis sheds light on the internal mechanisms underlying the relationship between speech and text modalities and could shape SLM optimization.
fig_1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.22473
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.22473 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.22473 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers