Tag

Voice

389 articles archived under #voice · RSS

arXiv — Machine Learning research 4h ago

Automatic Detection of Stress from Speech in the Trier Social Stress Test

arXiv:2607.00986v1 Announce Type: new Abstract: Automatically detecting stress in speech provides an unobtrusive way to gain insights relevant to behavioral research or clinical assessment. This study investigates the automatic differentiation between a stressful and…

12
arXiv — NLP / Computation & Language research 4h ago

Hate Speech Detection in Turkish and Arabic Languages: A Comprehensive Study

arXiv:2607.00143v1 Announce Type: new Abstract: Online hate speech has been linked to a global rise in violence against minorities, including incidents such as mass shootings, lynchings, and ethnic cleansing. Societies grappling with this issue, particularly when hate speech…

6
arXiv — NLP / Computation & Language research 4h ago

Speech Playground: An Interactive Tool for Speech Analysis and Comparison

arXiv:2607.00418v1 Announce Type: new Abstract: This paper presents Speech Playground, an interactive speech visualization and comparison tool. While existing tools such as Praat are excellent, it can be cumbersome to integrate them with modern deep learning representations and…

26
arXiv — NLP / Computation & Language research 4h ago

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

arXiv:2511.07397v3 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller,…

22
r/LocalLLaMA community 10h ago

My reasons to run local models

I can finetune any model on any dataset I want. I can use techniques like speculative decoding and other sota approaches to get the max tps The llm provides like anthropic and openai are not getting access to my data The hardware is reusable for vision text speech, and I can run…

10
r/LocalLLaMA community 16h ago

gemma-4-31B on Cerebras is better than ChatGPT voice mode

open models will win on inference too 🚀   submitted by   /u/paf1138 [link]   [comments]

34
Hugging Face Daily Papers research 20h ago

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Abstract Flexible Spoken Language Model (FlexiSLM) introduces dynamic frame rate capabilities for speech input and output, achieving superior performance over fixed-frame-rate models while enabling controllable inference speed. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spoken…

15
Hugging Face Daily Papers research 23h ago

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Abstract Multilingual safety and fairness benchmark for speech models reveals persistent vulnerabilities across languages and naturalistic conditions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-capable models are increasingly deployed in real-world applications across…

36
arXiv — Machine Learning research 1d ago

Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection

arXiv:2606.30675v1 Announce Type: cross Abstract: Early detection of dementia through speech analysis offers a non-invasive screening alternative, but capturing both acoustic and linguistic biomarkers remains challenging. We propose a multimodal framework leveraging Whisper for…

28
arXiv — NLP / Computation & Language research 1d ago

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with…

7
arXiv — NLP / Computation & Language research 1d ago

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

arXiv:2606.31112v1 Announce Type: new Abstract: ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including…

31
arXiv — NLP / Computation & Language research 1d ago

Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimer's Disease Detection

arXiv:2606.31186v1 Announce Type: new Abstract: Spontaneous speech is a vital non-invasive biomarker for Alzheimer's Disease (AD), yet many systems overlook non-linear structural disruptions and clinical heterogeneity in pathological language. We propose a Multi-View Gated Graph…

31
arXiv — NLP / Computation & Language research 1d ago

Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

arXiv:2606.31411v1 Announce Type: new Abstract: Rapid advancements in generative speech technology have compromised the reliability of voice biometrics. While current spoofing detectors excel when assessed under in-domain conditions, generalisation to out-of-domain settings is…

4
arXiv — NLP / Computation & Language research 1d ago

Building an ASR Solution for Training and Assessing Children's Reading

arXiv:2606.31508v1 Announce Type: new Abstract: Automatic speech recognition for children's reading remains underdeveloped for most African languages, including Bambara, despite its potential value for reproducible literacy assessment. We present an open-source system for…

30
arXiv — NLP / Computation & Language research 1d ago

Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition

arXiv:2606.31642v1 Announce Type: new Abstract: Southern Bantu languages are spoken by over 80 million people, yet current foundation ASR models still produce zero-shot WER above 100%, which limits practical use in education and public services. We addressed this gap with a tone…

18
arXiv — NLP / Computation & Language research 1d ago

Adapting Foundation ASR Models to Dysarthric Speech: A Case Study

arXiv:2606.31722v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems often perform poorly in dysarthric speech, limiting their usefulness to affected speakers in everyday communication. This paper presents a personalized ASR system for a dysarthric speaker,…

11
arXiv — NLP / Computation & Language research 1d ago

LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

arXiv:2606.31947v1 Announce Type: new Abstract: State-of-the-art speech datasets predominantly focus on widely spoken languages, often overlooking low-resource languages such as Luxembourgish, which remain underrepresented in speech technology research. In this work, we…

25
arXiv — NLP / Computation & Language research 1d ago

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

arXiv:2606.30646v1 Announce Type: cross Abstract: Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia…

18
arXiv — NLP / Computation & Language research 1d ago

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

arXiv:2606.31128v1 Announce Type: cross Abstract: Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion…

30
r/LocalLLaMA community 1d ago

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

I’m the author of audio.cpp, a C++/ggml runtime for local audio models. I just added VibeVoice 1.5B support and wanted to share the benchmark because long-form multi-speaker TTS is a good stress test for local inference runtimes. Result on RTX 5090: VibeVoice 1.5B Audio length:…

26
Hugging Face official-blog 1d ago

Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

Back to Articles a]:hidden"> Hugging Face and Cerebras bring Gemma 4 to real-time voice AI Published July 1, 2026 Update on GitHub Upvote - Amir Mahla A-Mahla Andres Marafioti andito Leandro von Werra lvwerra Saurabh Vyas vyassaurabh cerebras For voice AI, latency is a critical…

37
Hugging Face Daily Papers research 1d ago

One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications

Abstract A universal speech enhancement model with configurable algorithmic and computational latency controls using parallel convolutions and early-exit mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Different real-time speech applications impose distinct latency…

9
Hugging Face Daily Papers research 2d ago

Interleaved Speech Language Models Latently Work In Text

Abstract Interleaved speech-text language models exhibit an implicit transcription phase where text tokens become decodable in intermediate layers, followed by text-based prediction before speech domain transformation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech language…

16
arXiv — NLP / Computation & Language research 2d ago

Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

arXiv:2606.28772v1 Announce Type: new Abstract: Hate speech annotation pipelines routinely collapse annotator disagreement into majority vote labels before training. We show that this aggregation is not neutral: 42.6% of all annotator disagreement in HateXplain concentrates…

28
arXiv — NLP / Computation & Language research 2d ago

How to Leverage Synthetic Speech for LLM-Based ASR Systems?

arXiv:2606.29031v1 Announce Type: new Abstract: In regulated domains such as banking and healthcare, where privacy constraints make real speech costly to collect and retain, synthetic speech from modern text-to-speech (TTS) is an appealing alternative for training automatic…

15
arXiv — NLP / Computation & Language research 2d ago

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

arXiv:2606.29534v1 Announce Type: new Abstract: Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a…

23
Vercel — AI dev-tools 2d ago

Vercel Private Blob is now generally available

Vercel Private Blob is now generally available for all plans. Store sensitive files like user-uploaded photos, invoices, and agent memory, and control exactly who can read them. Private stores, Signed URLs, and OIDC authentication all graduate from beta with this release. Vercel…

22
r/MachineLearning community 2d ago

I'm trying to implement CALM paper, and I have some questions. [P]

Hello, I'm trying to implement the Pocket TTS by kyutai-labs represented by this paper . Since they have didn't released the training/fine-tuning code. I'm trying to implement it on my own for learning some stuff. I have read the paper, tried to implement it with much more…

34
Vercel — AI dev-tools 3d ago

Build realtime voice agents on AI Gateway

AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each…

26
arXiv — Machine Learning research 3d ago

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance…

7
arXiv — Machine Learning research 3d ago

What Was That Again? Certified Robustness for Automatic Speech Recognition

arXiv:2606.27698v1 Announce Type: new Abstract: Automatic Speech Recognition systems are notoriously both sensitive to adversarial and benign perturbations. While this has been repeatedly demonstrated using reference datasets, detecting such behaviors in deployed systems is…

19
arXiv — Machine Learning research 3d ago

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

arXiv:2606.27536v1 Announce Type: cross Abstract: Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for…

23
arXiv — Machine Learning research 3d ago

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

arXiv:2606.27543v1 Announce Type: cross Abstract: The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is…

37
arXiv — NLP / Computation & Language research 3d ago

A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges

arXiv:2606.27380v1 Announce Type: new Abstract: Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing…

6
arXiv — NLP / Computation & Language research 3d ago

Do Speech Emphasis Models Generalize across Languages and Emotions?

arXiv:2606.27717v1 Announce Type: new Abstract: Prosodic emphasis varies across languages, emotions, and speaking styles, yet existing emphasis detection models are largely trained and evaluated on monolingual neutral read speech. We introduce MMEE (Multilingual Multi-Emotion…

12
arXiv — NLP / Computation & Language research 3d ago

From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection

arXiv:2606.27973v1 Announce Type: new Abstract: Speech-based cognitive impairment detection offers a noninvasive, accessible alternative to costly biomarker assays, yet transformer-based models remain clinically uninterpretable. We propose a multi-stage explainability framework…

23
arXiv — NLP / Computation & Language research 3d ago

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

arXiv:2606.28249v1 Announce Type: cross Abstract: Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting…

20
arXiv — NLP / Computation & Language research 3d ago

Measuring the Redundancy of Decoder Layers in SpeechLLMs

arXiv:2603.05121v2 Announce Type: replace Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks.…

36
Hacker News — AI on Front Page community 3d ago

Age verification is just a precursor to automated attribution of speech

Article URL: https://nonogra.ph/age-verification-is-just-a-precursor-to-attribution-of-speech-06-29-2026 Comments URL: https://news.ycombinator.com/item?id=48714529 Points: 238 # Comments: 105

34
Vercel — AI dev-tools 3d ago

Realtime voice, speech, and transcription now supported on AI Gateway

AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with…

17
Vercel — AI dev-tools 3d ago

xAI Grok audio models now available on Vercel AI Gateway

xAI's audio models are now live on AI Gateway. Realtime voice, text to speech, and speech to text are all available through the AI SDK with the same routing, observability, and spend controls as your other models. These capabilities are available on the AI SDK 7 release.…

11
Hacker News — AI on Front Page community 3d ago

30-year sentence for transporting zines is a five-alarm fire for free speech

Article URL: https://theintercept.com/2026/06/26/daniel-sanchez-estrada-zines-prairieland-free-speech/ Comments URL: https://news.ycombinator.com/item?id=48711981 Points: 200 # Comments: 111

20
r/LocalLLaMA community 3d ago

Whisperian: It is one of the best applications for Android, if you want to use Mic with some local ASR models. And it is also available on Play Store.

  submitted by   /u/9r4n4y [link]   [comments]

29
r/MachineLearning community 4d ago

NagaTranslate: Building a translation and voice pipeline for low-resource Nagaland creoles (Whisper, VITS, LLMs) [P]

Hello r/MachineLearning , I wanted to share the architecture and challenges behind a project I’ve been building called NagaTranslate . The goal is to build a translation and speech pipeline for the low-resource languages of Nagaland, India (currently supporting Nagamese, Ao, and…

30
r/LocalLLaMA community 4d ago

Agentic Cyberdeck Dev

I developed this around August '25, but never had real polished panels. So, here we are with some decent panels, and new speakers for voice Al inferencing. This has local agentic GPS, chat, voice, vision analysis. This is a fun little project that I come back around to until I…

12
r/LocalLLaMA community 5d ago

Are there any qwen finetunes that were genuinely stronger than the base?

It's pretty popular to finetune qwen models but I never hear anyone say anything positive about them.   submitted by   /u/MrMrsPotts [link]   [comments]

30
r/LocalLLaMA community 5d ago

Streaming medical STT running locally on a MacBook

Quick teaser of what I’ve been working on over the last few weeks: a streaming medical speech-to-text model that runs fully on-device. This demo is running locally on a MacBook through MLX. Still doing more evals, but planning to release the open weights next week.  …

22
arXiv — NLP / Computation & Language research 6d ago

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a…

37
arXiv — NLP / Computation & Language research 6d ago

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

arXiv:2606.26452v1 Announce Type: new Abstract: To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but…

31
arXiv — NLP / Computation & Language research 6d ago

Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

arXiv:2606.26618v1 Announce Type: new Abstract: Large pretrained text-to-speech (TTS) models sound almost human for well-resourced languages, but much worse for languages that are rare in their training data. We study this quality gap for Khmer and Korean using VoxCPM2, a…

26

Automatic Detection of Stress from Speech in the Trier Social Stress Test

Hate Speech Detection in Turkish and Arabic Languages: A Comprehensive Study

Speech Playground: An Interactive Tool for Speech Analysis and Comparison

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

My reasons to run local models

gemma-4-31B on Cerebras is better than ChatGPT voice mode

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimer's Disease Detection

Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

Building an ASR Solution for Training and Assessing Children's Reading

Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition

Adapting Foundation ASR Models to Dysarthric Speech: A Case Study

LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications

Interleaved Speech Language Models Latently Work In Text

Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

How to Leverage Synthetic Speech for LLM-Based ASR Systems?

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Vercel Private Blob is now generally available

I'm trying to implement CALM paper, and I have some questions. [P]

Build realtime voice agents on AI Gateway

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

What Was That Again? Certified Robustness for Automatic Speech Recognition

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges

Do Speech Emphasis Models Generalize across Languages and Emotions?

From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

Measuring the Redundancy of Decoder Layers in SpeechLLMs

Age verification is just a precursor to automated attribution of speech

Realtime voice, speech, and transcription now supported on AI Gateway

xAI Grok audio models now available on Vercel AI Gateway

30-year sentence for transporting zines is a five-alarm fire for free speech

Whisperian: It is one of the best applications for Android, if you want to use Mic with some local ASR models. And it is also available on Play Store.

NagaTranslate: Building a translation and voice pipeline for low-resource Nagaland creoles (Whisper, VITS, LLMs) [P]

Agentic Cyberdeck Dev

Are there any qwen finetunes that were genuinely stronger than the base?

Streaming medical STT running locally on a MacBook

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean