News / #voice Tag Voice 391 articles archived under #voice · RSS Sign in to follow Hugging Face Daily Papers research 24d ago Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by… 35 arXiv — Machine Learning research 24d ago Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks arXiv:2606.06833v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal… 34 arXiv — NLP / Computation & Language research 24d ago Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition arXiv:2606.06985v1 Announce Type: new Abstract: Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive… 28 arXiv — NLP / Computation & Language research 24d ago KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026 arXiv:2606.07240v1 Announce Type: new Abstract: Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026… 21 arXiv — NLP / Computation & Language research 24d ago Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations arXiv:2606.06740v1 Announce Type: cross Abstract: Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech… 22 arXiv — NLP / Computation & Language research 24d ago HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec arXiv:2606.06743v1 Announce Type: cross Abstract: The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main… 21 arXiv — NLP / Computation & Language research 24d ago Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question… 14 arXiv — NLP / Computation & Language research 24d ago The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders? arXiv:2606.07435v1 Announce Type: cross Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI… 28 r/LocalLLaMA community 24d ago Best Local TTS solution So I have been testing a bunch of different solutions for local TTS - nothing so far comes close to elevenlabs for dynamic ability, voices, cloning. I’d like to have a phone-compatible setup. So far the best I can find for edge devices is moss-nano and kokoro. Free/cloud so far… 25 Hugging Face Daily Papers research 24d ago dots.tts Technical Report Abstract A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques. Generated by… 32 r/LocalLLaMA community 25d ago Dockerized Nemotron 3.5 ASR — Switched from Parakeet, better multilingual support + streaming (4.5x realtime speed on cpu) I was originally using Parakeet for my speech recognition pipeline but decided to give Nemotron 3.5 a shot. After testing it on some multilingual audio clips, it's been working great so far. What sold me: - Better language support (40+ locales from one model) - Native streaming… 17 r/LocalLLaMA community 26d ago Serving TTS/cloning models on llama.cpp? Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a separate container or conda for each model I… 17 r/LocalLLaMA community 26d ago dots.tts 2B🎙️ SOTA TTS from RedNote 🔗 Blog: https://rednote-hilab.github.io/dots.tts-demo/ 🔗 GitHub: https://github.com/rednote-hilab/dots.tts 🔗 Technical Report: https://arxiv.org/abs/2608.16894 dots.tts 🎙️ New open-source TTS from RedNote (Xiaohongshu) ✨ 2B parameters (Apache 2.0) ✨ Fully continuous… 16 Hugging Face Daily Papers research 27d ago Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs Abstract Code-switching automatic speech recognition models show limited generalization across unseen language pairs despite attempts at model merging and domain generalization techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Automatic Speech Recognition (ASR) has become… 35 arXiv — NLP / Computation & Language research 27d ago Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems arXiv:2606.05179v1 Announce Type: new Abstract: Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which… 21 arXiv — NLP / Computation & Language research 27d ago Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach arXiv:2606.05545v1 Announce Type: new Abstract: The development of multilingual Alzheimer's Disease Dementia (AD) detection models presents significant challenges due to the resource-intensive and time-consuming nature of language-specific model training. We propose a novel… 35 arXiv — NLP / Computation & Language research 27d ago InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization arXiv:2606.05561v1 Announce Type: new Abstract: Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve… 34 arXiv — NLP / Computation & Language research 27d ago Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs arXiv:2606.05569v1 Announce Type: new Abstract: Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs… 7 arXiv — NLP / Computation & Language research 27d ago Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs arXiv:2606.05846v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across… 8 arXiv — NLP / Computation & Language research 27d ago To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both.… 21 arXiv — NLP / Computation & Language research 27d ago Automatic Labelling of Speech Translation Errors arXiv:2606.06047v1 Announce Type: new Abstract: Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech… 21 arXiv — NLP / Computation & Language research 27d ago Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition arXiv:2606.06065v1 Announce Type: new Abstract: Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs.… 9 arXiv — NLP / Computation & Language research 27d ago Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an… 13 arXiv — NLP / Computation & Language research 27d ago FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition arXiv:2606.06211v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear… 18 arXiv — NLP / Computation & Language research 27d ago From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation arXiv:2606.06266v1 Announce Type: new Abstract: Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale.… 22 Hugging Face Daily Papers research 27d ago SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing Abstract A bilingual multi-attribute benchmark for instruction-guided speech editing is introduced to systematically evaluate speech modification capabilities across atomic and compositional tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Instruction-guided speech editing… 16 r/LocalLLaMA community 27d ago Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.   submitted by   /u/FerretLegitimate6929 [link]   [comments] 31 Hugging Face official-blog 28d ago How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Back to Articles How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Enterprise + Article Published June 4, 2026 Upvote - Maryam Motamedi maryameee nvidia Adi- margolin Amargolin nvidia Francesco fciannella nvidia Myungjong Kim Myungjong nvidia Enas Albasiri… 4 arXiv — Machine Learning research 28d ago Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers arXiv:2606.04678v1 Announce Type: new Abstract: End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse… 4 arXiv — NLP / Computation & Language research 28d ago Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention arXiv:2606.04474v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T)… 37 arXiv — NLP / Computation & Language research 28d ago Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing… 18 arXiv — NLP / Computation & Language research 28d ago Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026 arXiv:2606.04730v1 Announce Type: new Abstract: With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is… 7 arXiv — NLP / Computation & Language research 28d ago CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding arXiv:2606.04418v1 Announce Type: cross Abstract: Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency,… 6 Hugging Face Daily Papers research 28d ago OpenSTBench: Beyond Semantic Evaluation for Speech Translation Abstract OpenSTBench presents a unified evaluation framework for speech translation systems that assesses multiple dimensions including translation quality, speech quality, and temporal consistency across different modalities and settings. Generated by… 15 TechCrunch — AI news-outlet 29d ago These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked The startup's own stack for Africa and Middle East is now handling more than 17,000 calls per day. 21 arXiv — Machine Learning research 29d ago CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning arXiv:2606.02998v1 Announce Type: new Abstract: Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a… 4 arXiv — NLP / Computation & Language research 29d ago Benchmarking Speech-to-Speech Translation Models arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and… 5 arXiv — NLP / Computation & Language research 29d ago BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language arXiv:2606.03504v1 Announce Type: new Abstract: We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated… 4 arXiv — NLP / Computation & Language research 29d ago A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026 arXiv:2606.03948v1 Announce Type: new Abstract: We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task… 16 arXiv — NLP / Computation & Language research 29d ago Efficient ASR Training with Conversations that Never Happened arXiv:2606.03957v1 Announce Type: new Abstract: Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with… 21 arXiv — NLP / Computation & Language research 29d ago AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task arXiv:2606.03967v1 Announce Type: new Abstract: We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated… 16 Hugging Face Daily Papers research 29d ago Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures Abstract Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning… 17 TechCrunch — AI news-outlet 29d ago Martin Scorsese becomes the latest — and most unlikely — Hollywood voice for AI The caveat is that one of the world's most famous living directors is using the tech solely for storyboarding. 38 The Information — AI news-outlet 29d ago 5 Ways Companies Keep AI Bills in Check Snowflake CEO Sridhar Ramaswamy on Monday became the latest executive to voice concerns over rising AI costs . “Are we worried about how much we are spending on AI inference across our internal teams? Absolutely,” he told my colleague Laura during Snowflake’s annual conference… 23 Smol AI News news-outlet 1mo ago Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows **Microsoft** introduced **MAI-Thinking-1**, a **35B parameter MoE model** with **256K context**, achieving **97% on AIME 2025** and outperforming **Sonnet 4.6** in human preference tests. The broader **7-model MAI family** spans reasoning, code, image, speech, and voice, with… 37 r/LocalLLaMA community 1mo ago Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026 Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved… 20 arXiv — NLP / Computation & Language research 1mo ago SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors arXiv:2606.00460v1 Announce Type: new Abstract: Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering… 23 arXiv — NLP / Computation & Language research 1mo ago LaSR: Context-Aware Speech Recognition via Latent Reasoning arXiv:2606.00507v1 Announce Type: new Abstract: Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that… 4 arXiv — NLP / Computation & Language research 1mo ago PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects arXiv:2606.01016v1 Announce Type: new Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a… 19 arXiv — NLP / Computation & Language research 1mo ago Child-directed speech facilitates production, not comprehension, in BabyLMs arXiv:2606.01045v1 Announce Type: new Abstract: Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of… 26 Page 5 of 8 · 391 articles ← Newer Older →