News / #voice Tag Voice 389 articles archived under #voice · RSS Sign in to follow arXiv — Machine Learning research 4h ago Automatic Detection of Stress from Speech in the Trier Social Stress Test arXiv:2607.00986v1 Announce Type: new Abstract: Automatically detecting stress in speech provides an unobtrusive way to gain insights relevant to behavioral research or clinical assessment. This study investigates the automatic differentiation between a stressful and… 12 arXiv — NLP / Computation & Language research 4h ago Hate Speech Detection in Turkish and Arabic Languages: A Comprehensive Study arXiv:2607.00143v1 Announce Type: new Abstract: Online hate speech has been linked to a global rise in violence against minorities, including incidents such as mass shootings, lynchings, and ethnic cleansing. Societies grappling with this issue, particularly when hate speech… 6 arXiv — NLP / Computation & Language research 4h ago Speech Playground: An Interactive Tool for Speech Analysis and Comparison arXiv:2607.00418v1 Announce Type: new Abstract: This paper presents Speech Playground, an interactive speech visualization and comparison tool. While existing tools such as Praat are excellent, it can be cumbersome to integrate them with modern deep learning representations and… 26 arXiv — NLP / Computation & Language research 4h ago Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents arXiv:2511.07397v3 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller,… 22 r/LocalLLaMA community 10h ago My reasons to run local models I can finetune any model on any dataset I want. I can use techniques like speculative decoding and other sota approaches to get the max tps The llm provides like anthropic and openai are not getting access to my data The hardware is reusable for vision text speech, and I can run… 10 r/LocalLLaMA community 16h ago gemma-4-31B on Cerebras is better than ChatGPT voice mode open models will win on inference too 🚀   submitted by   /u/paf1138 [link]   [comments] 34 Hugging Face Daily Papers research 20h ago FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model Abstract Flexible Spoken Language Model (FlexiSLM) introduces dynamic frame rate capabilities for speech input and output, achieving superior performance over fixed-frame-rate models while enabling controllable inference speed. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spoken… 15 Hugging Face Daily Papers research 23h ago RedVox: Safety and Fairness Gaps in Speech Models Across Languages Abstract Multilingual safety and fairness benchmark for speech models reveals persistent vulnerabilities across languages and naturalistic conditions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-capable models are increasingly deployed in real-world applications across… 36 arXiv — Machine Learning research 1d ago Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection arXiv:2606.30675v1 Announce Type: cross Abstract: Early detection of dementia through speech analysis offers a non-invasive screening alternative, but capturing both acoustic and linguistic biomarkers remains challenging. We propose a multimodal framework leveraging Whisper for… 28 arXiv — NLP / Computation & Language research 1d ago Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with… 7 arXiv — NLP / Computation & Language research 1d ago What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR arXiv:2606.31112v1 Announce Type: new Abstract: ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including… 31 arXiv — NLP / Computation & Language research 1d ago Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimer's Disease Detection arXiv:2606.31186v1 Announce Type: new Abstract: Spontaneous speech is a vital non-invasive biomarker for Alzheimer's Disease (AD), yet many systems overlook non-linear structural disruptions and clinical heterogeneity in pathological language. We propose a Multi-View Gated Graph… 31 arXiv — NLP / Computation & Language research 1d ago Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck arXiv:2606.31411v1 Announce Type: new Abstract: Rapid advancements in generative speech technology have compromised the reliability of voice biometrics. While current spoofing detectors excel when assessed under in-domain conditions, generalisation to out-of-domain settings is… 4 arXiv — NLP / Computation & Language research 1d ago Building an ASR Solution for Training and Assessing Children's Reading arXiv:2606.31508v1 Announce Type: new Abstract: Automatic speech recognition for children's reading remains underdeveloped for most African languages, including Bambara, despite its potential value for reproducible literacy assessment. We present an open-source system for… 30 arXiv — NLP / Computation & Language research 1d ago Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition arXiv:2606.31642v1 Announce Type: new Abstract: Southern Bantu languages are spoken by over 80 million people, yet current foundation ASR models still produce zero-shot WER above 100%, which limits practical use in education and public services. We addressed this gap with a tone… 18 arXiv — NLP / Computation & Language research 1d ago Adapting Foundation ASR Models to Dysarthric Speech: A Case Study arXiv:2606.31722v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems often perform poorly in dysarthric speech, limiting their usefulness to affected speakers in everyday communication. This paper presents a personalized ASR system for a dysarthric speaker,… 11 arXiv — NLP / Computation & Language research 1d ago LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish arXiv:2606.31947v1 Announce Type: new Abstract: State-of-the-art speech datasets predominantly focus on widely spoken languages, often overlooking low-resource languages such as Luxembourgish, which remain underrepresented in speech technology research. In this work, we… 25 arXiv — NLP / Computation & Language research 1d ago ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection arXiv:2606.30646v1 Announce Type: cross Abstract: Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia… 18 arXiv — NLP / Computation & Language research 1d ago UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling arXiv:2606.31128v1 Announce Type: cross Abstract: Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion… 30 r/LocalLLaMA community 1d ago [audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml I’m the author of audio.cpp, a C++/ggml runtime for local audio models. I just added VibeVoice 1.5B support and wanted to share the benchmark because long-form multi-speaker TTS is a good stress test for local inference runtimes. Result on RTX 5090: VibeVoice 1.5B Audio length:… 26 Hugging Face official-blog 1d ago Hugging Face and Cerebras bring Gemma 4 to real-time voice AI Back to Articles a]:hidden"> Hugging Face and Cerebras bring Gemma 4 to real-time voice AI Published July 1, 2026 Update on GitHub Upvote - Amir Mahla A-Mahla Andres Marafioti andito Leandro von Werra lvwerra Saurabh Vyas vyassaurabh cerebras For voice AI, latency is a critical… 37 Hugging Face Daily Papers research 1d ago One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications Abstract A universal speech enhancement model with configurable algorithmic and computational latency controls using parallel convolutions and early-exit mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Different real-time speech applications impose distinct latency… 9 Hugging Face Daily Papers research 2d ago Interleaved Speech Language Models Latently Work In Text Abstract Interleaved speech-text language models exhibit an implicit transcription phase where text tokens become decodable in intermediate layers, followed by text-based prediction before speech domain transformation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech language… 16 arXiv — NLP / Computation & Language research 2d ago Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain arXiv:2606.28772v1 Announce Type: new Abstract: Hate speech annotation pipelines routinely collapse annotator disagreement into majority vote labels before training. We show that this aggregation is not neutral: 42.6% of all annotator disagreement in HateXplain concentrates… 28 arXiv — NLP / Computation & Language research 2d ago How to Leverage Synthetic Speech for LLM-Based ASR Systems? arXiv:2606.29031v1 Announce Type: new Abstract: In regulated domains such as banking and healthcare, where privacy constraints make real speech costly to collect and retain, synthetic speech from modern text-to-speech (TTS) is an appealing alternative for training automatic… 15 arXiv — NLP / Computation & Language research 2d ago Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs arXiv:2606.29534v1 Announce Type: new Abstract: Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a… 23 Vercel — AI dev-tools 2d ago Vercel Private Blob is now generally available Vercel Private Blob is now generally available for all plans. Store sensitive files like user-uploaded photos, invoices, and agent memory, and control exactly who can read them. Private stores, Signed URLs, and OIDC authentication all graduate from beta with this release. Vercel… 22 r/MachineLearning community 2d ago I'm trying to implement CALM paper, and I have some questions. [P] Hello, I'm trying to implement the Pocket TTS by kyutai-labs represented by this paper . Since they have didn't released the training/fine-tuning code. I'm trying to implement it on my own for learning some stuff. I have read the paper, tried to implement it with much more… 34 Vercel — AI dev-tools 3d ago Build realtime voice agents on AI Gateway AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each… 26 arXiv — Machine Learning research 3d ago HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance… 7 arXiv — Machine Learning research 3d ago What Was That Again? Certified Robustness for Automatic Speech Recognition arXiv:2606.27698v1 Announce Type: new Abstract: Automatic Speech Recognition systems are notoriously both sensitive to adversarial and benign perturbations. While this has been repeatedly demonstrated using reference datasets, detecting such behaviors in deployed systems is… 19 arXiv — Machine Learning research 3d ago Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition arXiv:2606.27536v1 Announce Type: cross Abstract: Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for… 23 arXiv — Machine Learning research 3d ago Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings arXiv:2606.27543v1 Announce Type: cross Abstract: The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is… 37 arXiv — NLP / Computation & Language research 3d ago A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges arXiv:2606.27380v1 Announce Type: new Abstract: Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing… 6 arXiv — NLP / Computation & Language research 3d ago Do Speech Emphasis Models Generalize across Languages and Emotions? arXiv:2606.27717v1 Announce Type: new Abstract: Prosodic emphasis varies across languages, emotions, and speaking styles, yet existing emphasis detection models are largely trained and evaluated on monolingual neutral read speech. We introduce MMEE (Multilingual Multi-Emotion… 12 arXiv — NLP / Computation & Language research 3d ago From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection arXiv:2606.27973v1 Announce Type: new Abstract: Speech-based cognitive impairment detection offers a noninvasive, accessible alternative to costly biomarker assays, yet transformer-based models remain clinically uninterpretable. We propose a multi-stage explainability framework… 23 arXiv — NLP / Computation & Language research 3d ago HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech arXiv:2606.28249v1 Announce Type: cross Abstract: Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting… 20 arXiv — NLP / Computation & Language research 3d ago Measuring the Redundancy of Decoder Layers in SpeechLLMs arXiv:2603.05121v2 Announce Type: replace Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks.… 36 Hacker News — AI on Front Page community 3d ago Age verification is just a precursor to automated attribution of speech Article URL: https://nonogra.ph/age-verification-is-just-a-precursor-to-attribution-of-speech-06-29-2026 Comments URL: https://news.ycombinator.com/item?id=48714529 Points: 238 # Comments: 105 34 Vercel — AI dev-tools 3d ago Realtime voice, speech, and transcription now supported on AI Gateway AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with… 17 Vercel — AI dev-tools 3d ago xAI Grok audio models now available on Vercel AI Gateway xAI's audio models are now live on AI Gateway. Realtime voice, text to speech, and speech to text are all available through the AI SDK with the same routing, observability, and spend controls as your other models. These capabilities are available on the AI SDK 7 release.… 11 Hacker News — AI on Front Page community 3d ago 30-year sentence for transporting zines is a five-alarm fire for free speech Article URL: https://theintercept.com/2026/06/26/daniel-sanchez-estrada-zines-prairieland-free-speech/ Comments URL: https://news.ycombinator.com/item?id=48711981 Points: 200 # Comments: 111 20 r/LocalLLaMA community 3d ago Whisperian: It is one of the best applications for Android, if you want to use Mic with some local ASR models. And it is also available on Play Store.   submitted by   /u/9r4n4y [link]   [comments] 29 r/MachineLearning community 4d ago NagaTranslate: Building a translation and voice pipeline for low-resource Nagaland creoles (Whisper, VITS, LLMs) [P] Hello r/MachineLearning , I wanted to share the architecture and challenges behind a project I’ve been building called NagaTranslate . The goal is to build a translation and speech pipeline for the low-resource languages of Nagaland, India (currently supporting Nagamese, Ao, and… 30 r/LocalLLaMA community 4d ago Agentic Cyberdeck Dev I developed this around August '25, but never had real polished panels. So, here we are with some decent panels, and new speakers for voice Al inferencing. This has local agentic GPS, chat, voice, vision analysis. This is a fun little project that I come back around to until I… 12 r/LocalLLaMA community 5d ago Are there any qwen finetunes that were genuinely stronger than the base? It's pretty popular to finetune qwen models but I never hear anyone say anything positive about them.   submitted by   /u/MrMrsPotts [link]   [comments] 30 r/LocalLLaMA community 5d ago Streaming medical STT running locally on a MacBook Quick teaser of what I’ve been working on over the last few weeks: a streaming medical speech-to-text model that runs fully on-device. This demo is running locally on a MacBook through MLX. Still doing more evals, but planning to release the open weights next week.  … 22 arXiv — NLP / Computation & Language research 6d ago Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a… 37 arXiv — NLP / Computation & Language research 6d ago AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification arXiv:2606.26452v1 Announce Type: new Abstract: To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but… 31 arXiv — NLP / Computation & Language research 6d ago Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean arXiv:2606.26618v1 Announce Type: new Abstract: Large pretrained text-to-speech (TTS) models sound almost human for well-resourced languages, but much worse for languages that are rare in their training data. We study this quality gap for Khmer and Korean using VoxCPM2, a… 26 Page 1 of 8 · 389 articles Older →