News / #voice Tag Voice 389 articles archived under #voice · RSS Sign in to follow r/LocalLLaMA community 13d ago My suitcase robot gets high now off a real gas sensor wired straight into the LLM sampler. Smoke raises temperature/top_p/top_k live, so his speech genuinely gets loopier and never repeats. Follow-up on Sparky, my offline suitcase robot I keep overdeveloping. He gets high now, and there's no scripted "stoned mode" anywhere in it. A real MQ-2 gas sensor sits in the case. Every 0.5s I read it against an adaptive clean-air baseline and turn a smoke hit into a 0 to 10… 30 r/MachineLearning community 13d ago Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D] I have been thinking a lot about how poorly isolated benchmark metrics capture real conversational system quality once models are deployed into multi-turn environments. You can have strong STT scores, decent latency, high task completion rates, and still end up with… 25 arXiv — NLP / Computation & Language research 14d ago Fair Cognitive Impairment Detection Through Unlearning arXiv:2606.18571v1 Announce Type: cross Abstract: Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned… 33 arXiv — NLP / Computation & Language research 14d ago Continuous Audio Thinking for Large Audio Language Models arXiv:2606.18273v1 Announce Type: new Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned… 37 arXiv — NLP / Computation & Language research 14d ago Montreal Forced Aligner and the state of speech-to-text alignment in 2026 arXiv:2606.18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded… 5 arXiv — NLP / Computation & Language research 14d ago Speech-Driven End-to-End Language Discrimination towards Chinese Dialects arXiv:2606.18584v1 Announce Type: new Abstract: Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of… 12 arXiv — NLP / Computation & Language research 14d ago Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining arXiv:2606.18852v1 Announce Type: new Abstract: Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface… 35 arXiv — NLP / Computation & Language research 14d ago Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies arXiv:2606.18264v1 Announce Type: cross Abstract: Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors… 8 arXiv — NLP / Computation & Language research 14d ago Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment arXiv:2606.18979v1 Announce Type: cross Abstract: Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but… 36 arXiv — NLP / Computation & Language research 14d ago IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge… 35 arXiv — NLP / Computation & Language research 14d ago Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech arXiv:2506.12311v3 Announce Type: replace Abstract: Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically… 38 arXiv — NLP / Computation & Language research 14d ago TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving arXiv:2508.07375v3 Announce Type: replace Abstract: Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and… 20 arXiv — NLP / Computation & Language research 14d ago UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition arXiv:2509.14653v2 Announce Type: replace Abstract: This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that… 28 MIT News — AI research 14d ago MIT in the media: For the future of tech, "Massachusetts can absolutely lead" Leaders, faculty across MIT discuss fostering innovation and talent in Greater Boston in special series of articles published alongside the outlet's annual list of 'Tech Power Players' 27 r/LocalLLaMA community 14d ago I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. I’ve been experimenting with how small a usable neural TTS model can realistically get, and I just released Inflect-Nano-v1 . As far as I researched (though I could be wrong on this), Inflect-Nano-v1 is the #2 smallest TTS model publicly released (after TinyTTS) , and it… 24 arXiv — NLP / Computation & Language research 15d ago MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task arXiv:2606.17255v1 Announce Type: new Abstract: This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create… 20 arXiv — NLP / Computation & Language research 15d ago Are you speaking my languages? On spoken language adherence in multimodal LLMs arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To… 9 arXiv — NLP / Computation & Language research 15d ago Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation arXiv:2606.17820v1 Announce Type: new Abstract: This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of… 28 arXiv — NLP / Computation & Language research 15d ago When Multiple Scripts Matter: Evaluating ASR in Clinical Settings arXiv:2606.17826v1 Announce Type: new Abstract: Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics… 20 arXiv — NLP / Computation & Language research 15d ago Perceptual compensation for tonal context in self-supervised speech models arXiv:2606.17835v1 Announce Type: new Abstract: This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones,… 30 arXiv — NLP / Computation & Language research 15d ago Learning task-specific subspaces via interventional post-training of speech foundation models arXiv:2606.17967v1 Announce Type: new Abstract: Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech… 5 arXiv — NLP / Computation & Language research 15d ago SpeechDx: A Multi-Task Benchmark for Clinical Speech AI arXiv:2606.17339v1 Announce Type: cross Abstract: Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated… 15 arXiv — NLP / Computation & Language research 15d ago Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition arXiv:2606.17537v1 Announce Type: cross Abstract: Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is… 30 arXiv — NLP / Computation & Language research 15d ago ALAS: An Automatic Latent Alignment Score for Audio Language Models arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion… 17 arXiv — NLP / Computation & Language research 16d ago A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior… 7 arXiv — NLP / Computation & Language research 16d ago Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation arXiv:2606.15266v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains… 16 arXiv — NLP / Computation & Language research 16d ago Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback arXiv:2606.15325v1 Announce Type: new Abstract: Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in… 4 arXiv — NLP / Computation & Language research 16d ago ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition arXiv:2606.15984v1 Announce Type: new Abstract: Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the… 4 arXiv — NLP / Computation & Language research 16d ago Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design arXiv:2606.16009v1 Announce Type: new Abstract: Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains… 23 arXiv — NLP / Computation & Language research 16d ago Scaling Human and G2P Supervision for Robust Phonetic Transcription arXiv:2606.16019v1 Announce Type: new Abstract: Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We… 20 arXiv — NLP / Computation & Language research 16d ago PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization arXiv:2606.16074v1 Announce Type: new Abstract: Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior… 38 arXiv — NLP / Computation & Language research 16d ago XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models arXiv:2606.16137v1 Announce Type: new Abstract: Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution,… 31 arXiv — NLP / Computation & Language research 16d ago TMASC: Transmasculine Attitude and Speech Corpus arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the… 25 arXiv — Machine Learning research 17d ago Beyond task performance: Decoding bioacoustic embeddings with speech features arXiv:2606.14662v1 Announce Type: new Abstract: Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species… 6 arXiv — NLP / Computation & Language research 17d ago Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR arXiv:2606.14391v1 Announce Type: new Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior… 15 arXiv — NLP / Computation & Language research 17d ago MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition arXiv:2606.14459v1 Announce Type: new Abstract: Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech… 6 arXiv — NLP / Computation & Language research 17d ago BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM arXiv:2606.14528v1 Announce Type: new Abstract: Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in.… 10 arXiv — NLP / Computation & Language research 17d ago Multimodal Speaker Identification in Classroom Environments arXiv:2606.13712v1 Announce Type: cross Abstract: Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework… 24 arXiv — NLP / Computation & Language research 17d ago OLaPh: Optimal Language Phonemizer arXiv:2509.20086v4 Announce Type: replace Abstract: Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary… 8 arXiv — NLP / Computation & Language research 17d ago Chronological Thinking in Full-Duplex Spoken Dialogue Language Models arXiv:2510.05150v3 Announce Type: replace Abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This… 7 r/LocalLLaMA community 17d ago Gemma 12b less than 10 watts 6.5pp 1.3tg Google pixel 10 pro Termux Llamacpp version: 9639 (ef8268fee) $ ./llama.cpp/build_vulkan/bin/llama-cli -m storage/downloads/gemma-4-12b-it-UD-Q3_K_XL.gguf --model-draft storage/downloads/mtp-gemma-4-12b-it.gguf --temp 1.0 --top-p 0.95 --top-k 64 --spec-type draft-mtp… 5 r/LocalLLaMA community 17d ago Voice-to-voice chatbot update I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B… 33 r/LocalLLaMA community 17d ago Gemma 4 models benchmarked on with Triple GPU Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo… 29 r/LocalLLaMA community 17d ago Gemma 4 12B native encoder free voice input utilization suggest? Hey everyone, ​ Like many of you, I’m looking into the newly released Gemma 4 12B to build a native speech-to-speech experience. Because of its unique encoder-free architecture, completely skipping the traditional STT bottleneck could be possible. ​ Right now, my… 20 r/MachineLearning community 18d ago Confused, where to start [D] Hello community, I am a backend + big data dev. I want to learn about the llms that generate voices. I also read some articles but almost everyone of them starts from regression. There are so much resources available right now that I am now confused where to begin with.  … 14 r/LocalLLaMA community 19d ago ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning https://reddit.com/link/1u4lk5c/video/kyhdw0uog07h1/player Links: Blog: https://zyphra.com/our-work/zonos2 Weights: https://huggingface.co/Zyphra/ZONOS2 Inference code: https://github.com/Zyphra/ZONOS2 Eval code: https://github.com/Zyphra/ZTTS1-Eval Model TTSDS Prosody Score ↑… 15 arXiv — NLP / Computation & Language research 20d ago PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue arXiv:2606.12902v1 Announce Type: new Abstract: Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while… 10 arXiv — NLP / Computation & Language research 20d ago PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation arXiv:2606.12911v1 Announce Type: new Abstract: Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST,… 8 arXiv — NLP / Computation & Language research 20d ago NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation arXiv:2606.13121v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive… 17 arXiv — NLP / Computation & Language research 20d ago Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations arXiv:2606.13464v1 Announce Type: new Abstract: Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires… 11 Page 3 of 8 · 389 articles ← Newer Older →