News / #voice Tag Voice 389 articles archived under #voice · RSS Sign in to follow arXiv — NLP / Computation & Language research 6d ago FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following arXiv:2606.26819v1 Announce Type: new Abstract: This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLMs are developed for both short-form and long-form speech instruction following under constrained settings. For the short track,… 14 arXiv — NLP / Computation & Language research 6d ago SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages arXiv:2606.26901v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare context remains largely unknown. In this study, we first… 6 arXiv — NLP / Computation & Language research 6d ago RedVox: Safety and Fairness Gaps in Speech Models Across Languages arXiv:2606.26968v1 Announce Type: new Abstract: Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting… 35 arXiv — NLP / Computation & Language research 6d ago Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While… 36 r/LocalLLaMA community 6d ago audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything… 24 Hugging Face Daily Papers research 6d ago Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach Abstract A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance. Generated by… 30 r/MachineLearning community 6d ago Looking for arXiv endorsement (eess.AS or cs.SD) [R] Hi, I'm an undergrad researcher looking for an arXiv endorsement to submit my first paper in the audio/speech processing domain (keyword spotting on microcontrollers). I've submitted to a peer-reviewed IEEE conference and am awaiting results, but want to get a preprint up in the… 26 r/LocalLLaMA community 7d ago Has anyone tried to hack into their own system using a local model? With all this talk about Mythos being able to hack into. US government systems, I was wondering if anyone has tried to get root on their own system using a local model?   submitted by   /u/MrMrsPotts [link]   [comments] 18 arXiv — NLP / Computation & Language research 7d ago Graph-Based Phonetic Error Correction of Noisy ASR arXiv:2606.24889v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing… 37 arXiv — NLP / Computation & Language research 7d ago Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction arXiv:2606.24915v1 Announce Type: new Abstract: End-to-end automatic speech recognition systems frequently hallucinate rare entities and domain-specific terms, especially in low-resource languages. While retrieval-augmented generation frameworks can mitigate these errors using… 18 arXiv — NLP / Computation & Language research 7d ago Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis arXiv:2606.25459v1 Announce Type: new Abstract: While self-supervised speech models have achieved strong performance across speech tasks, relatively little is known about how their internal phonetic representations behave under fine-grained dialect variation. Existing probing… 11 arXiv — NLP / Computation & Language research 7d ago How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat… 23 arXiv — NLP / Computation & Language research 7d ago SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations… 29 arXiv — NLP / Computation & Language research 7d ago Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect arXiv:2606.26003v1 Announce Type: new Abstract: Automatic speech and language technologies are still heavily biased toward high-resource languages, limiting their applicability to dialectal and low-resource settings such as Algerian Dialect. This language presents additional… 28 arXiv — NLP / Computation & Language research 7d ago Real-Time Voice AI Hears but Does Not Listen arXiv:2606.26083v1 Announce Type: new Abstract: Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on… 34 arXiv — NLP / Computation & Language research 7d ago Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis arXiv:2606.25369v1 Announce Type: cross Abstract: While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique… 36 arXiv — NLP / Computation & Language research 7d ago Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS arXiv:2606.25424v1 Announce Type: cross Abstract: Diffusion-based text-to-speech (TTS) models have achieved significant improvements in speech quality. However, modeling sharp prosodic transitions and rapid pitch variations in expressive speech remains challenging. Existing… 37 arXiv — NLP / Computation & Language research 7d ago Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models arXiv:2606.25436v1 Announce Type: cross Abstract: Dialogue systems based on large language models (LLMs) have advanced significantly in recent years. However, dialectal variation remains a major challenge, particularly for systems that process spoken input. LLM-based speech… 34 arXiv — NLP / Computation & Language research 7d ago Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs? arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on… 23 arXiv — NLP / Computation & Language research 7d ago Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced… 24 arXiv — NLP / Computation & Language research 7d ago Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme-Based Analysis of Climate Discourse arXiv:2601.13317v2 Announce Type: replace Abstract: Climate discourse online shapes public understanding of climate change and informs political and policy debate, yet it unfolds across structurally different environments: paid advertising platforms host targeted,… 9 Hacker News — AI on Front Page community 7d ago Founding a company in Germany: €9600, 152 days and I still can't send an invoice Article URL: https://paolino.me/founding-a-company-in-germany/ Comments URL: https://news.ycombinator.com/item?id=48658718 Points: 282 # Comments: 334 10 r/LocalLLaMA community 8d ago llama.cpp updates - granite-speech-4.1-2b, LFM2.5-ColBERT/Embedding-350M, Vulkan backend related changes & Misc items Supported Models : granite-speech-4.1-2b-plus by 24818 LFM2.5-ColBERT-350M & LFM2.5-Embedding-350M by 24913 Vulkan : vulkan: link ggml-cpu when GGML_VULKAN_CHECK_RESULTS / RUN_TESTS are enabled #24444 vulkan: make mul_mm ALIGNED a spec constant #24689 vulkan: support CONV_3D… 27 arXiv — Machine Learning research 8d ago NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction arXiv:2606.24087v1 Announce Type: new Abstract: Reconstructing continuous speech from scalp electroencephalography (EEG) remains fundamentally challenging. EEG provides a weak, spatially diffuse, and highly variable measurement of distributed cortical activity, whereas speech is… 9 arXiv — NLP / Computation & Language research 8d ago Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English arXiv:2606.23948v1 Announce Type: new Abstract: Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is… 6 arXiv — NLP / Computation & Language research 8d ago Automatic Part-of-Speech Tagging of Arabic-English Dictionary Senses through WordNet arXiv:2606.24359v1 Announce Type: new Abstract: This paper proposed an algorithm for part-of-speech (POS) tagging senses of a bilingual dictionary. The algorithm is applied on the Al-Mawrid Arabic-English dictionary. The tagging task is accomplished by transferring the POS tags… 21 arXiv — NLP / Computation & Language research 8d ago Measuring User's Mental Models of Speech Translation in Human-AI Collaboration arXiv:2606.24644v1 Announce Type: new Abstract: Millions of people use machine translation (MT) tools daily, yet little is known about their perception of what systems can and cannot do. This paper studies users' mental models of speech translation systems through a new… 13 arXiv — NLP / Computation & Language research 8d ago CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation arXiv:2606.24714v1 Announce Type: new Abstract: Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening… 33 arXiv — NLP / Computation & Language research 8d ago L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models arXiv:2606.24825v1 Announce Type: new Abstract: Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most… 16 arXiv — NLP / Computation & Language research 8d ago Progressive Alignment Objectives for Aligner-Encoder based ASR arXiv:2606.24147v1 Announce Type: cross Abstract: Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without… 23 arXiv — NLP / Computation & Language research 8d ago ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained… 15 Hugging Face official-blog 8d ago Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Back to Articles a]:hidden"> Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Published June 24, 2026 Update on GitHub Upvote 2 Daniel Gert Nielsen daniel-treble treble-technologies Shivam Saini whojavumusic treble-technologies Alessia Milo alessia-treble… 11 r/LocalLLaMA community 8d ago CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't… 19 llama.cpp releases dev-tools 8d ago b9768 model: Granite Speech Plus ( #24818 ) feat: Add conversion support for Granite Speech Plus Branch: GraniteSpeechPlus AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart ghart@us.ibm.com feat: Extend granite_speech to support plus multi-layer concatenation… 27 r/MachineLearning community 9d ago Recommendations for speech annotation tools [D] I'm looking for human-in-the-loop platforms that allow you to automatically transcribe audio followed by manually fixing the transcriptions and fine tuning the model. Is there a local (not an online service) installable platform for doing this?   submitted by  … 11 r/MachineLearning community 10d ago Best current methods for finetuning whisper on domain specific vocabulary? [P] Hey everyone, I’m wondering whether there are any newer or more effective methods for fine tuning whisper on domain specific speech. I’m working on a project where the model needs to reliably detect certain specific words and technical terms. The vocabulary and context are… 4 Hacker News — AI on Front Page community 12d ago A new bill takes aim at government pressure to silence lawful online speech Article URL: https://www.eff.org/deeplinks/2026/06/new-bill-takes-aim-government-pressure-silence-lawful-online-speech Comments URL: https://news.ycombinator.com/item?id=48600950 Points: 205 # Comments: 111 27 r/LocalLLaMA community 12d ago How do you guys setup search with your AI models? Been selfhosting my models for a while and I'd really like to integrate Gemma 4 12B as a simple voice assistant with search capabilities. I've tried using openwebui but the search is kind of broken with DDG and I really don't want to use API keys from Brave or Google etc. So… 25 r/LocalLLaMA community 12d ago Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti) I wanted to find the exact floor for running an intelligent, local voice assistant agent on consumer hardware. I kept the environment, tools, and prompts identical, I stepped the model sizes down through Qwen 3.5 9B, 4B, 2B, and 0.8B to see how agentic reasoning degrades. The… 12 Hugging Face Daily Papers research 12d ago Duration Aware Scheduling for ASR Serving Under Workload Drift Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by… 26 arXiv — Machine Learning research 13d ago IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for… 35 arXiv — NLP / Computation & Language research 13d ago Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling arXiv:2606.19354v1 Announce Type: new Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the… 5 arXiv — NLP / Computation & Language research 13d ago A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization arXiv:2606.19591v1 Announce Type: new Abstract: In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to… 9 arXiv — NLP / Computation & Language research 13d ago Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal arXiv:2606.19910v1 Announce Type: new Abstract: Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised… 31 arXiv — NLP / Computation & Language research 13d ago ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion arXiv:2606.20179v1 Announce Type: new Abstract: Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial… 21 arXiv — NLP / Computation & Language research 13d ago CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges arXiv:2606.20369v1 Announce Type: new Abstract: Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats,… 5 arXiv — NLP / Computation & Language research 13d ago Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations arXiv:2606.19951v1 Announce Type: cross Abstract: Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via… 34 arXiv — NLP / Computation & Language research 13d ago Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning arXiv:2606.19996v1 Announce Type: cross Abstract: \noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset… 25 arXiv — NLP / Computation & Language research 13d ago PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors arXiv:2606.20137v1 Announce Type: cross Abstract: Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA),… 35 arXiv — NLP / Computation & Language research 13d ago Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual… 7 Page 2 of 8 · 389 articles ← Newer Older →