Tag

Voice

389 articles archived under #voice · RSS

r/LocalLLaMA community 13d ago

My suitcase robot gets high now off a real gas sensor wired straight into the LLM sampler. Smoke raises temperature/top_p/top_k live, so his speech genuinely gets loopier and never repeats.

Follow-up on Sparky, my offline suitcase robot I keep overdeveloping. He gets high now, and there's no scripted "stoned mode" anywhere in it. A real MQ-2 gas sensor sits in the case. Every 0.5s I read it against an adaptive clean-air baseline and turn a smoke hit into a 0 to 10…

30
r/MachineLearning community 13d ago

Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D]

I have been thinking a lot about how poorly isolated benchmark metrics capture real conversational system quality once models are deployed into multi-turn environments. You can have strong STT scores, decent latency, high task completion rates, and still end up with…

25
arXiv — NLP / Computation & Language research 14d ago

Fair Cognitive Impairment Detection Through Unlearning

arXiv:2606.18571v1 Announce Type: cross Abstract: Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned…

33
arXiv — NLP / Computation & Language research 14d ago

Continuous Audio Thinking for Large Audio Language Models

arXiv:2606.18273v1 Announce Type: new Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned…

37
arXiv — NLP / Computation & Language research 14d ago

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

arXiv:2606.18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded…

5
arXiv — NLP / Computation & Language research 14d ago

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

arXiv:2606.18584v1 Announce Type: new Abstract: Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of…

12
arXiv — NLP / Computation & Language research 14d ago

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

arXiv:2606.18852v1 Announce Type: new Abstract: Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface…

35
arXiv — NLP / Computation & Language research 14d ago

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

arXiv:2606.18264v1 Announce Type: cross Abstract: Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors…

8
arXiv — NLP / Computation & Language research 14d ago

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

arXiv:2606.18979v1 Announce Type: cross Abstract: Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but…

36
arXiv — NLP / Computation & Language research 14d ago

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge…

35
arXiv — NLP / Computation & Language research 14d ago

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

arXiv:2506.12311v3 Announce Type: replace Abstract: Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically…

38
arXiv — NLP / Computation & Language research 14d ago

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

arXiv:2508.07375v3 Announce Type: replace Abstract: Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and…

20
arXiv — NLP / Computation & Language research 14d ago

UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

arXiv:2509.14653v2 Announce Type: replace Abstract: This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that…

28
MIT News — AI research 14d ago

MIT in the media: For the future of tech, "Massachusetts can absolutely lead"

Leaders, faculty across MIT discuss fostering innovation and talent in Greater Boston in special series of articles published alongside the outlet's annual list of 'Tech Power Players'

27
r/LocalLLaMA community 14d ago

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model.

I’ve been experimenting with how small a usable neural TTS model can realistically get, and I just released Inflect-Nano-v1 . As far as I researched (though I could be wrong on this), Inflect-Nano-v1 is the #2 smallest TTS model publicly released (after TinyTTS) , and it…

24
arXiv — NLP / Computation & Language research 15d ago

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

arXiv:2606.17255v1 Announce Type: new Abstract: This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create…

20
arXiv — NLP / Computation & Language research 15d ago

Are you speaking my languages? On spoken language adherence in multimodal LLMs

arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To…

9
arXiv — NLP / Computation & Language research 15d ago

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

arXiv:2606.17820v1 Announce Type: new Abstract: This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of…

28
arXiv — NLP / Computation & Language research 15d ago

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

arXiv:2606.17826v1 Announce Type: new Abstract: Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics…

20
arXiv — NLP / Computation & Language research 15d ago

Perceptual compensation for tonal context in self-supervised speech models

arXiv:2606.17835v1 Announce Type: new Abstract: This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones,…

30
arXiv — NLP / Computation & Language research 15d ago

Learning task-specific subspaces via interventional post-training of speech foundation models

arXiv:2606.17967v1 Announce Type: new Abstract: Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech…

5
arXiv — NLP / Computation & Language research 15d ago

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

arXiv:2606.17339v1 Announce Type: cross Abstract: Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated…

15
arXiv — NLP / Computation & Language research 15d ago

Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition

arXiv:2606.17537v1 Announce Type: cross Abstract: Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is…

30
arXiv — NLP / Computation & Language research 15d ago

ALAS: An Automatic Latent Alignment Score for Audio Language Models

arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion…

17
arXiv — NLP / Computation & Language research 16d ago

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior…

7
arXiv — NLP / Computation & Language research 16d ago

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

arXiv:2606.15266v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains…

16
arXiv — NLP / Computation & Language research 16d ago

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

arXiv:2606.15325v1 Announce Type: new Abstract: Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in…

4
arXiv — NLP / Computation & Language research 16d ago

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

arXiv:2606.15984v1 Announce Type: new Abstract: Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the…

4
arXiv — NLP / Computation & Language research 16d ago

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

arXiv:2606.16009v1 Announce Type: new Abstract: Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains…

23
arXiv — NLP / Computation & Language research 16d ago

Scaling Human and G2P Supervision for Robust Phonetic Transcription

arXiv:2606.16019v1 Announce Type: new Abstract: Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We…

20
arXiv — NLP / Computation & Language research 16d ago

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

arXiv:2606.16074v1 Announce Type: new Abstract: Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior…

38
arXiv — NLP / Computation & Language research 16d ago

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

arXiv:2606.16137v1 Announce Type: new Abstract: Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution,…

31
arXiv — NLP / Computation & Language research 16d ago

TMASC: Transmasculine Attitude and Speech Corpus

arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the…

25
arXiv — Machine Learning research 17d ago

Beyond task performance: Decoding bioacoustic embeddings with speech features

arXiv:2606.14662v1 Announce Type: new Abstract: Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species…

6
arXiv — NLP / Computation & Language research 17d ago

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

arXiv:2606.14391v1 Announce Type: new Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior…

15
arXiv — NLP / Computation & Language research 17d ago

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

arXiv:2606.14459v1 Announce Type: new Abstract: Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech…

6
arXiv — NLP / Computation & Language research 17d ago

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

arXiv:2606.14528v1 Announce Type: new Abstract: Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in.…

10
arXiv — NLP / Computation & Language research 17d ago

Multimodal Speaker Identification in Classroom Environments

arXiv:2606.13712v1 Announce Type: cross Abstract: Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework…

24
arXiv — NLP / Computation & Language research 17d ago

OLaPh: Optimal Language Phonemizer

arXiv:2509.20086v4 Announce Type: replace Abstract: Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary…

8
arXiv — NLP / Computation & Language research 17d ago

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

arXiv:2510.05150v3 Announce Type: replace Abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This…

7
r/LocalLLaMA community 17d ago

Gemma 12b less than 10 watts 6.5pp 1.3tg

Google pixel 10 pro Termux Llamacpp version: 9639 (ef8268fee) $ ./llama.cpp/build_vulkan/bin/llama-cli -m storage/downloads/gemma-4-12b-it-UD-Q3_K_XL.gguf --model-draft storage/downloads/mtp-gemma-4-12b-it.gguf --temp 1.0 --top-p 0.95 --top-k 64 --spec-type draft-mtp…

5
r/LocalLLaMA community 17d ago

Voice-to-voice chatbot update

I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B…

33
r/LocalLLaMA community 17d ago

Gemma 4 models benchmarked on with Triple GPU

Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo…

29
r/LocalLLaMA community 17d ago

Gemma 4 12B native encoder free voice input utilization suggest?

Hey everyone,  Like many of you, I’m looking into the newly released Gemma 4 12B to build a native speech-to-speech experience. Because of its unique encoder-free architecture, completely skipping the traditional STT bottleneck could be possible.  Right now, my…

20
r/MachineLearning community 18d ago

Confused, where to start [D]

Hello community, I am a backend + big data dev. I want to learn about the llms that generate voices. I also read some articles but almost everyone of them starts from regression. There are so much resources available right now that I am now confused where to begin with.  …

14
r/LocalLLaMA community 19d ago

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

https://reddit.com/link/1u4lk5c/video/kyhdw0uog07h1/player Links: Blog: https://zyphra.com/our-work/zonos2 Weights: https://huggingface.co/Zyphra/ZONOS2 Inference code: https://github.com/Zyphra/ZONOS2 Eval code: https://github.com/Zyphra/ZTTS1-Eval Model TTSDS Prosody Score ↑…

15
arXiv — NLP / Computation & Language research 20d ago

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

arXiv:2606.12902v1 Announce Type: new Abstract: Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while…

10
arXiv — NLP / Computation & Language research 20d ago

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

arXiv:2606.12911v1 Announce Type: new Abstract: Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST,…

8
arXiv — NLP / Computation & Language research 20d ago

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

arXiv:2606.13121v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive…

17
arXiv — NLP / Computation & Language research 20d ago

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

arXiv:2606.13464v1 Announce Type: new Abstract: Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires…

11

My suitcase robot gets high now off a real gas sensor wired straight into the LLM sampler. Smoke raises temperature/top_p/top_k live, so his speech genuinely gets loopier and never repeats.

Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D]

Fair Cognitive Impairment Detection Through Unlearning

Continuous Audio Thinking for Large Audio Language Models

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

MIT in the media: For the future of tech, "Massachusetts can absolutely lead"

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model.

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

Are you speaking my languages? On spoken language adherence in multimodal LLMs

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

Perceptual compensation for tonal context in self-supervised speech models

Learning task-specific subspaces via interventional post-training of speech foundation models

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition

ALAS: An Automatic Latent Alignment Score for Audio Language Models

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

Scaling Human and G2P Supervision for Robust Phonetic Transcription

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

TMASC: Transmasculine Attitude and Speech Corpus

Beyond task performance: Decoding bioacoustic embeddings with speech features

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

Multimodal Speaker Identification in Classroom Environments

OLaPh: Optimal Language Phonemizer

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Gemma 12b less than 10 watts 6.5pp 1.3tg

Voice-to-voice chatbot update

Gemma 4 models benchmarked on with Triple GPU

Gemma 4 12B native encoder free voice input utilization suggest?

Confused, where to start [D]

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations