Tag

Voice

391 articles archived under #voice · RSS

Hugging Face Daily Papers research 24d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35
arXiv — Machine Learning research 24d ago

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

arXiv:2606.06833v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal…

34
arXiv — NLP / Computation & Language research 24d ago

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

arXiv:2606.06985v1 Announce Type: new Abstract: Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive…

28
arXiv — NLP / Computation & Language research 24d ago

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

arXiv:2606.07240v1 Announce Type: new Abstract: Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026…

21
arXiv — NLP / Computation & Language research 24d ago

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

arXiv:2606.06740v1 Announce Type: cross Abstract: Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech…

22
arXiv — NLP / Computation & Language research 24d ago

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

arXiv:2606.06743v1 Announce Type: cross Abstract: The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main…

21
arXiv — NLP / Computation & Language research 24d ago

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question…

14
arXiv — NLP / Computation & Language research 24d ago

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

arXiv:2606.07435v1 Announce Type: cross Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI…

28
r/LocalLLaMA community 24d ago

Best Local TTS solution

So I have been testing a bunch of different solutions for local TTS - nothing so far comes close to elevenlabs for dynamic ability, voices, cloning. I’d like to have a phone-compatible setup. So far the best I can find for edge devices is moss-nano and kokoro. Free/cloud so far…

25
Hugging Face Daily Papers research 24d ago

dots.tts Technical Report

Abstract A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques. Generated by…

32
r/LocalLLaMA community 25d ago

Dockerized Nemotron 3.5 ASR — Switched from Parakeet, better multilingual support + streaming (4.5x realtime speed on cpu)

I was originally using Parakeet for my speech recognition pipeline but decided to give Nemotron 3.5 a shot. After testing it on some multilingual audio clips, it's been working great so far. What sold me: - Better language support (40+ locales from one model) - Native streaming…

17
r/LocalLLaMA community 26d ago

Serving TTS/cloning models on llama.cpp?

Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a separate container or conda for each model I…

17
r/LocalLLaMA community 26d ago

dots.tts 2B🎙️ SOTA TTS from RedNote

🔗 Blog: https://rednote-hilab.github.io/dots.tts-demo/ 🔗 GitHub: https://github.com/rednote-hilab/dots.tts 🔗 Technical Report: https://arxiv.org/abs/2608.16894 dots.tts 🎙️ New open-source TTS from RedNote (Xiaohongshu) ✨ 2B parameters (Apache 2.0) ✨ Fully continuous…

16
Hugging Face Daily Papers research 27d ago

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Abstract Code-switching automatic speech recognition models show limited generalization across unseen language pairs despite attempts at model merging and domain generalization techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Automatic Speech Recognition (ASR) has become…

35
arXiv — NLP / Computation & Language research 27d ago

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

arXiv:2606.05179v1 Announce Type: new Abstract: Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which…

21
arXiv — NLP / Computation & Language research 27d ago

Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

arXiv:2606.05545v1 Announce Type: new Abstract: The development of multilingual Alzheimer's Disease Dementia (AD) detection models presents significant challenges due to the resource-intensive and time-consuming nature of language-specific model training. We propose a novel…

35
arXiv — NLP / Computation & Language research 27d ago

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

arXiv:2606.05561v1 Announce Type: new Abstract: Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve…

34
arXiv — NLP / Computation & Language research 27d ago

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

arXiv:2606.05569v1 Announce Type: new Abstract: Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs…

7
arXiv — NLP / Computation & Language research 27d ago

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

arXiv:2606.05846v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across…

8
arXiv — NLP / Computation & Language research 27d ago

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both.…

21
arXiv — NLP / Computation & Language research 27d ago

Automatic Labelling of Speech Translation Errors

arXiv:2606.06047v1 Announce Type: new Abstract: Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech…

21
arXiv — NLP / Computation & Language research 27d ago

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

arXiv:2606.06065v1 Announce Type: new Abstract: Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs.…

9
arXiv — NLP / Computation & Language research 27d ago

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an…

13
arXiv — NLP / Computation & Language research 27d ago

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

arXiv:2606.06211v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear…

18
arXiv — NLP / Computation & Language research 27d ago

From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

arXiv:2606.06266v1 Announce Type: new Abstract: Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale.…

22
Hugging Face Daily Papers research 27d ago

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Abstract A bilingual multi-attribute benchmark for instruction-guided speech editing is introduced to systematically evaluate speech modification capabilities across atomic and compositional tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Instruction-guided speech editing…

16
r/LocalLLaMA community 27d ago

Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.

  submitted by   /u/FerretLegitimate6929 [link]   [comments]

31
Hugging Face official-blog 28d ago

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

Back to Articles How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Enterprise + Article Published June 4, 2026 Upvote - Maryam Motamedi maryameee nvidia Adi- margolin Amargolin nvidia Francesco fciannella nvidia Myungjong Kim Myungjong nvidia Enas Albasiri…

4
arXiv — Machine Learning research 28d ago

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

arXiv:2606.04678v1 Announce Type: new Abstract: End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse…

4
arXiv — NLP / Computation & Language research 28d ago

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

arXiv:2606.04474v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T)…

37
arXiv — NLP / Computation & Language research 28d ago

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing…

18
arXiv — NLP / Computation & Language research 28d ago

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

arXiv:2606.04730v1 Announce Type: new Abstract: With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is…

7
arXiv — NLP / Computation & Language research 28d ago

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

arXiv:2606.04418v1 Announce Type: cross Abstract: Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency,…

6
Hugging Face Daily Papers research 28d ago

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Abstract OpenSTBench presents a unified evaluation framework for speech translation systems that assesses multiple dimensions including translation quality, speech quality, and temporal consistency across different modalities and settings. Generated by…

15
TechCrunch — AI news-outlet 29d ago

These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked

The startup's own stack for Africa and Middle East is now handling more than 17,000 calls per day.

21
arXiv — Machine Learning research 29d ago

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

arXiv:2606.02998v1 Announce Type: new Abstract: Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a…

4
arXiv — NLP / Computation & Language research 29d ago

Benchmarking Speech-to-Speech Translation Models

arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and…

5
arXiv — NLP / Computation & Language research 29d ago

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

arXiv:2606.03504v1 Announce Type: new Abstract: We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated…

4
arXiv — NLP / Computation & Language research 29d ago

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

arXiv:2606.03948v1 Announce Type: new Abstract: We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task…

16
arXiv — NLP / Computation & Language research 29d ago

Efficient ASR Training with Conversations that Never Happened

arXiv:2606.03957v1 Announce Type: new Abstract: Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with…

21
arXiv — NLP / Computation & Language research 29d ago

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

arXiv:2606.03967v1 Announce Type: new Abstract: We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated…

16
Hugging Face Daily Papers research 29d ago

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Abstract Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning…

17
TechCrunch — AI news-outlet 29d ago

Martin Scorsese becomes the latest — and most unlikely — Hollywood voice for AI

The caveat is that one of the world's most famous living directors is using the tech solely for storyboarding.

38
The Information — AI news-outlet 29d ago

5 Ways Companies Keep AI Bills in Check

Snowflake CEO Sridhar Ramaswamy on Monday became the latest executive to voice concerns over rising AI costs . “Are we worried about how much we are spending on AI inference across our internal teams? Absolutely,” he told my colleague Laura during Snowflake’s annual conference…

23
Smol AI News news-outlet 1mo ago

Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows

**Microsoft** introduced **MAI-Thinking-1**, a **35B parameter MoE model** with **256K context**, achieving **97% on AIME 2025** and outperforming **Sonnet 4.6** in human preference tests. The broader **7-model MAI family** spans reasoning, code, image, speech, and voice, with…

37
r/LocalLLaMA community 1mo ago

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved…

20
arXiv — NLP / Computation & Language research 1mo ago

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

arXiv:2606.00460v1 Announce Type: new Abstract: Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering…

23
arXiv — NLP / Computation & Language research 1mo ago

LaSR: Context-Aware Speech Recognition via Latent Reasoning

arXiv:2606.00507v1 Announce Type: new Abstract: Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that…

4
arXiv — NLP / Computation & Language research 1mo ago

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

arXiv:2606.01016v1 Announce Type: new Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a…

19
arXiv — NLP / Computation & Language research 1mo ago

Child-directed speech facilitates production, not comprehension, in BabyLMs

arXiv:2606.01045v1 Announce Type: new Abstract: Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of…

26

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Best Local TTS solution

dots.tts Technical Report

Dockerized Nemotron 3.5 ASR — Switched from Parakeet, better multilingual support + streaming (4.5x realtime speed on cpu)

Serving TTS/cloning models on llama.cpp?

dots.tts 2B🎙️ SOTA TTS from RedNote

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Automatic Labelling of Speech Translation Errors

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

Benchmarking Speech-to-Speech Translation Models

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

Efficient ASR Training with Conversations that Never Happened

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Martin Scorsese becomes the latest — and most unlikely — Hollywood voice for AI

5 Ways Companies Keep AI Bills in Check

Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

LaSR: Context-Aware Speech Recognition via Latent Reasoning

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Child-directed speech facilitates production, not comprehension, in BabyLMs