Tag

Voice

389 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 6d ago

FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

arXiv:2606.26819v1 Announce Type: new Abstract: This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLMs are developed for both short-form and long-form speech instruction following under constrained settings. For the short track,…

14
arXiv — NLP / Computation & Language research 6d ago

SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages

arXiv:2606.26901v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare context remains largely unknown. In this study, we first…

6
arXiv — NLP / Computation & Language research 6d ago

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

arXiv:2606.26968v1 Announce Type: new Abstract: Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting…

35
arXiv — NLP / Computation & Language research 6d ago

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While…

36
r/LocalLLaMA community 6d ago

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything…

24
Hugging Face Daily Papers research 6d ago

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Abstract A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance. Generated by…

30
r/MachineLearning community 6d ago

Looking for arXiv endorsement (eess.AS or cs.SD) [R]

Hi, I'm an undergrad researcher looking for an arXiv endorsement to submit my first paper in the audio/speech processing domain (keyword spotting on microcontrollers). I've submitted to a peer-reviewed IEEE conference and am awaiting results, but want to get a preprint up in the…

26
r/LocalLLaMA community 7d ago

Has anyone tried to hack into their own system using a local model?

With all this talk about Mythos being able to hack into. US government systems, I was wondering if anyone has tried to get root on their own system using a local model?   submitted by   /u/MrMrsPotts [link]   [comments]

18
arXiv — NLP / Computation & Language research 7d ago

Graph-Based Phonetic Error Correction of Noisy ASR

arXiv:2606.24889v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing…

37
arXiv — NLP / Computation & Language research 7d ago

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

arXiv:2606.24915v1 Announce Type: new Abstract: End-to-end automatic speech recognition systems frequently hallucinate rare entities and domain-specific terms, especially in low-resource languages. While retrieval-augmented generation frameworks can mitigate these errors using…

18
arXiv — NLP / Computation & Language research 7d ago

Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis

arXiv:2606.25459v1 Announce Type: new Abstract: While self-supervised speech models have achieved strong performance across speech tasks, relatively little is known about how their internal phonetic representations behave under fine-grained dialect variation. Existing probing…

11
arXiv — NLP / Computation & Language research 7d ago

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat…

23
arXiv — NLP / Computation & Language research 7d ago

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations…

29
arXiv — NLP / Computation & Language research 7d ago

Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect

arXiv:2606.26003v1 Announce Type: new Abstract: Automatic speech and language technologies are still heavily biased toward high-resource languages, limiting their applicability to dialectal and low-resource settings such as Algerian Dialect. This language presents additional…

28
arXiv — NLP / Computation & Language research 7d ago

Real-Time Voice AI Hears but Does Not Listen

arXiv:2606.26083v1 Announce Type: new Abstract: Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on…

34
arXiv — NLP / Computation & Language research 7d ago

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

arXiv:2606.25369v1 Announce Type: cross Abstract: While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique…

36
arXiv — NLP / Computation & Language research 7d ago

Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

arXiv:2606.25424v1 Announce Type: cross Abstract: Diffusion-based text-to-speech (TTS) models have achieved significant improvements in speech quality. However, modeling sharp prosodic transitions and rapid pitch variations in expressive speech remains challenging. Existing…

37
arXiv — NLP / Computation & Language research 7d ago

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

arXiv:2606.25436v1 Announce Type: cross Abstract: Dialogue systems based on large language models (LLMs) have advanced significantly in recent years. However, dialectal variation remains a major challenge, particularly for systems that process spoken input. LLM-based speech…

34
arXiv — NLP / Computation & Language research 7d ago

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on…

23
arXiv — NLP / Computation & Language research 7d ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced…

24
arXiv — NLP / Computation & Language research 7d ago

Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme-Based Analysis of Climate Discourse

arXiv:2601.13317v2 Announce Type: replace Abstract: Climate discourse online shapes public understanding of climate change and informs political and policy debate, yet it unfolds across structurally different environments: paid advertising platforms host targeted,…

9
Hacker News — AI on Front Page community 7d ago

Founding a company in Germany: €9600, 152 days and I still can't send an invoice

Article URL: https://paolino.me/founding-a-company-in-germany/ Comments URL: https://news.ycombinator.com/item?id=48658718 Points: 282 # Comments: 334

10
r/LocalLLaMA community 8d ago

llama.cpp updates - granite-speech-4.1-2b, LFM2.5-ColBERT/Embedding-350M, Vulkan backend related changes & Misc items

Supported Models : granite-speech-4.1-2b-plus by 24818 LFM2.5-ColBERT-350M & LFM2.5-Embedding-350M by 24913 Vulkan : vulkan: link ggml-cpu when GGML_VULKAN_CHECK_RESULTS / RUN_TESTS are enabled #24444 vulkan: make mul_mm ALIGNED a spec constant #24689 vulkan: support CONV_3D…

27
arXiv — Machine Learning research 8d ago

NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction

arXiv:2606.24087v1 Announce Type: new Abstract: Reconstructing continuous speech from scalp electroencephalography (EEG) remains fundamentally challenging. EEG provides a weak, spatially diffuse, and highly variable measurement of distributed cortical activity, whereas speech is…

9
arXiv — NLP / Computation & Language research 8d ago

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

arXiv:2606.23948v1 Announce Type: new Abstract: Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is…

6
arXiv — NLP / Computation & Language research 8d ago

Automatic Part-of-Speech Tagging of Arabic-English Dictionary Senses through WordNet

arXiv:2606.24359v1 Announce Type: new Abstract: This paper proposed an algorithm for part-of-speech (POS) tagging senses of a bilingual dictionary. The algorithm is applied on the Al-Mawrid Arabic-English dictionary. The tagging task is accomplished by transferring the POS tags…

21
arXiv — NLP / Computation & Language research 8d ago

Measuring User's Mental Models of Speech Translation in Human-AI Collaboration

arXiv:2606.24644v1 Announce Type: new Abstract: Millions of people use machine translation (MT) tools daily, yet little is known about their perception of what systems can and cannot do. This paper studies users' mental models of speech translation systems through a new…

13
arXiv — NLP / Computation & Language research 8d ago

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

arXiv:2606.24714v1 Announce Type: new Abstract: Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening…

33
arXiv — NLP / Computation & Language research 8d ago

L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

arXiv:2606.24825v1 Announce Type: new Abstract: Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most…

16
arXiv — NLP / Computation & Language research 8d ago

Progressive Alignment Objectives for Aligner-Encoder based ASR

arXiv:2606.24147v1 Announce Type: cross Abstract: Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without…

23
arXiv — NLP / Computation & Language research 8d ago

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained…

15
Hugging Face official-blog 8d ago

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Back to Articles a]:hidden"> Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Published June 24, 2026 Update on GitHub Upvote 2 Daniel Gert Nielsen daniel-treble treble-technologies Shivam Saini whojavumusic treble-technologies Alessia Milo alessia-treble…

11
r/LocalLLaMA community 8d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't…

19
llama.cpp releases dev-tools 8d ago

b9768

model: Granite Speech Plus ( #24818 ) feat: Add conversion support for Granite Speech Plus Branch: GraniteSpeechPlus AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart ghart@us.ibm.com feat: Extend granite_speech to support plus multi-layer concatenation…

27
r/MachineLearning community 9d ago

Recommendations for speech annotation tools [D]

I'm looking for human-in-the-loop platforms that allow you to automatically transcribe audio followed by manually fixing the transcriptions and fine tuning the model. Is there a local (not an online service) installable platform for doing this?   submitted by  …

11
r/MachineLearning community 10d ago

Best current methods for finetuning whisper on domain specific vocabulary? [P]

Hey everyone, I’m wondering whether there are any newer or more effective methods for fine tuning whisper on domain specific speech. I’m working on a project where the model needs to reliably detect certain specific words and technical terms. The vocabulary and context are…

4
Hacker News — AI on Front Page community 12d ago

A new bill takes aim at government pressure to silence lawful online speech

Article URL: https://www.eff.org/deeplinks/2026/06/new-bill-takes-aim-government-pressure-silence-lawful-online-speech Comments URL: https://news.ycombinator.com/item?id=48600950 Points: 205 # Comments: 111

27
r/LocalLLaMA community 12d ago

How do you guys setup search with your AI models?

Been selfhosting my models for a while and I'd really like to integrate Gemma 4 12B as a simple voice assistant with search capabilities. I've tried using openwebui but the search is kind of broken with DDG and I really don't want to use API keys from Brave or Google etc. So…

25
r/LocalLLaMA community 12d ago

Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti)

I wanted to find the exact floor for running an intelligent, local voice assistant agent on consumer hardware. I kept the environment, tools, and prompts identical, I stepped the model sizes down through Qwen 3.5 9B, 4B, 2B, and 0.8B to see how agentic reasoning degrades. The…

12
Hugging Face Daily Papers research 12d ago

Duration Aware Scheduling for ASR Serving Under Workload Drift

Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by…

26
arXiv — Machine Learning research 13d ago

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for…

35
arXiv — NLP / Computation & Language research 13d ago

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

arXiv:2606.19354v1 Announce Type: new Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the…

5
arXiv — NLP / Computation & Language research 13d ago

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

arXiv:2606.19591v1 Announce Type: new Abstract: In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to…

9
arXiv — NLP / Computation & Language research 13d ago

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

arXiv:2606.19910v1 Announce Type: new Abstract: Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised…

31
arXiv — NLP / Computation & Language research 13d ago

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

arXiv:2606.20179v1 Announce Type: new Abstract: Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial…

21
arXiv — NLP / Computation & Language research 13d ago

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

arXiv:2606.20369v1 Announce Type: new Abstract: Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats,…

5
arXiv — NLP / Computation & Language research 13d ago

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

arXiv:2606.19951v1 Announce Type: cross Abstract: Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via…

34
arXiv — NLP / Computation & Language research 13d ago

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

arXiv:2606.19996v1 Announce Type: cross Abstract: \noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset…

25
arXiv — NLP / Computation & Language research 13d ago

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

arXiv:2606.20137v1 Announce Type: cross Abstract: Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA),…

35
arXiv — NLP / Computation & Language research 13d ago

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual…

7

FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Looking for arXiv endorsement (eess.AS or cs.SD) [R]

Has anyone tried to hack into their own system using a local model?

Graph-Based Phonetic Error Correction of Noisy ASR

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect

Real-Time Voice AI Hears but Does Not Listen

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme-Based Analysis of Climate Discourse

Founding a company in Germany: €9600, 152 days and I still can't send an invoice

llama.cpp updates - granite-speech-4.1-2b, LFM2.5-ColBERT/Embedding-350M, Vulkan backend related changes & Misc items

NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

Automatic Part-of-Speech Tagging of Arabic-English Dictionary Senses through WordNet

Measuring User's Mental Models of Speech Translation in Human-AI Collaboration

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

Progressive Alignment Objectives for Aligner-Encoder based ASR

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

b9768

Recommendations for speech annotation tools [D]

Best current methods for finetuning whisper on domain specific vocabulary? [P]

A new bill takes aim at government pressure to silence lawful online speech

How do you guys setup search with your AI models?

Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti)

Duration Aware Scheduling for ASR Serving Under Workload Drift

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech