Tag

Voice

391 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 1mo ago

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv:2605.23975v1 Announce Type: new Abstract: Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission,…

14
arXiv — NLP / Computation & Language research 1mo ago

End-to-End Intracortical Speech Decoding from Neural Activity

arXiv:2605.24313v1 Announce Type: new Abstract: Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate…

38
arXiv — NLP / Computation & Language research 1mo ago

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv:2605.24451v1 Announce Type: new Abstract: Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for…

24
arXiv — NLP / Computation & Language research 1mo ago

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

arXiv:2605.25404v1 Announce Type: new Abstract: Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However,…

28
r/MachineLearning community 1mo ago

Best architecture for seamless Bilingual TTS? (Azure / English + Korean) [D]

Hi guys, when building a language learning app (React Native/Expo frontend, Python backend) and I’ve hit a frustrating wall with Text-to-Speech. I need the app to read sentences that mix English instructions and Korean examples (e.g., "To say hello, we use the phrase 안녕하세요.").…

20
arXiv — Machine Learning research 1mo ago

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

arXiv:2605.23235v1 Announce Type: new Abstract: Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language…

29
arXiv — NLP / Computation & Language research 1mo ago

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

arXiv:2605.22828v1 Announce Type: new Abstract: This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo…

12
arXiv — NLP / Computation & Language research 1mo ago

AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

arXiv:2605.23325v1 Announce Type: new Abstract: Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication. While hate speech and misinformation have been widely studied,…

36
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Gaslighting Attacks Against Speech Large Language Models

arXiv:2509.19858v2 Announce Type: replace Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied…

18
Simon Willison community 1mo ago

Quoting Armin Ronacher

The most frustrating failure mode right now is that people submit issues that are not in their own voice. They contain an observed problem somewhere, but it has been thrown into a clanker and the clanker reworded it and made a huge mess of it. Typically, it was prompted so badly…

18
r/LocalLLaMA community 1mo ago

TTS Benchmark Comparison (all known TTS up until May 2026)

I was tired of not having a proper TTS related benchmark that I can use and test for personal projects, so I had to make one. Hopefully this helps those looking for running local TTS tools. Has Windows and Mac results already. Linux will be tested shortly (have a 5900XT and 3090…

23
TechCrunch — AI news-outlet 1mo ago

AI is being used to resurrect the voices of dead pilots

People used AI on a spectrogram image of cockpit recordings to reconstruct them, forcing the NTSB to temporarily block access to its docket system.

11
r/LocalLLaMA community 1mo ago

I fine-tuned Cohere Transcribe to support diarization and timestamps

Hi I'll keep it short: Cohere-transcribe is currently the best open source speech to text model (and possibly even better than other proprietary models). BUT it doesn't support diarization (speaker identification) and timestamps, even though there are tokens for it in the…

36
Ars Technica — AI news-outlet 1mo ago

US scrambles to stop Internet users re-creating dead pilots’ voices

Workaround flouts law that bans NTSB disclosures of cockpit audio recordings.

13
Hacker News — AI on Front Page community 1mo ago

Steve Wozniak cheered after telling students they have AI – actual intelligence

Article URL: https://www.businessinsider.com/steve-wozniak-apple-ai-graduation-speech-2026-5 Comments URL: https://news.ycombinator.com/item?id=48233563 Points: 243 # Comments: 197

7
arXiv — NLP / Computation & Language research 1mo ago

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

arXiv:2605.22170v1 Announce Type: new Abstract: In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in…

9
arXiv — NLP / Computation & Language research 1mo ago

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

arXiv:2605.22435v1 Announce Type: new Abstract: Hate speech and misinformation frequently co-occur online, amplifying prejudice and polarization. Given their scale, using Large Language Models (LLMs) to assist expert counterspeech (CS) writing has gained interest, yet prior work…

31
arXiv — NLP / Computation & Language research 1mo ago

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

arXiv:2605.22650v1 Announce Type: new Abstract: As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and…

5
r/LocalLLaMA community 1mo ago

Best solution to generate reports locally with graphs, charts? Beginner question.

So on a local lm like ollama, or lm studio etc. you can run questions and prompts. But it’s a text response and I am unable to have it generate pdfs or report files graphs. Such as a pie chart on my invoices. Or create a report for me on statistics. When I run kimi, or Claude…

11
Hugging Face Daily Papers research 1mo ago

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Abstract Mega-ASR framework improves robustness in real-world speech recognition through compound-data construction and progressive acoustic-to-semantic optimization techniques. AI-generated summary Despite rapid advances in automatic speech recognition (ASR) and large…

29
arXiv — NLP / Computation & Language research 1mo ago

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

arXiv:2605.20356v1 Announce Type: new Abstract: Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how…

4
arXiv — NLP / Computation & Language research 1mo ago

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

arXiv:2605.20712v1 Announce Type: new Abstract: Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error…

16
arXiv — NLP / Computation & Language research 1mo ago

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

arXiv:2605.20920v1 Announce Type: new Abstract: Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment…

27
arXiv — NLP / Computation & Language research 1mo ago

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

arXiv:2605.20946v1 Announce Type: new Abstract: The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during…

18
Hacker News — AI on Front Page community 1mo ago

College students drown out AI-praising commencement speeches with boos

Article URL: https://www.tomshardware.com/tech-industry/artificial-intelligence/college-students-drown-out-ai-praising-commencement-speeches-with-boos-deal-with-it-one-speaker-fires-back-as-students-heckle-positive-pitches-for-ais-role Comments URL:…

21
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv:2605.19069v1 Announce Type: new Abstract: Code-switching -- the natural alternation between two languages within a single utterance -- represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR…

31
arXiv — NLP / Computation & Language research 1mo ago

FormalASR: End-to-End Spoken Chinese to Formal Text

arXiv:2605.19266v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented…

29
arXiv — NLP / Computation & Language research 1mo ago

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

arXiv:2605.19711v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error…

21
arXiv — NLP / Computation & Language research 1mo ago

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

arXiv:2605.19833v1 Announce Type: cross Abstract: Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic…

37
TechCrunch — AI news-outlet 1mo ago

You can now talk to your Gmail inbox, as seen at Google IO 2026

Google expands Gmail’s AI Inbox with conversational voice search, letting users ask Gemini to find buried email details.

15
r/LocalLLaMA community 1mo ago

Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

For those of you running Qwen3.6:27B on 16GB VRAM, what quantization did you settle on? For my primary purpose as a HA voice assistant, I've found my ideal target to be >50 tg and >800 pp. Qwen3.5:9B works really fast, but I'm experimenting with higher intelligence. Offloaded…

14
TechCrunch — AI news-outlet 1mo ago

Google adds voice-based prompting to Docs and Keep

Google is letting users create drafts, take notes, and search for email with voice with the new Workspace update

16
TechCrunch — AI news-outlet 1mo ago

Google’s AI now lets you talk to your Gmail inbox

Google expands Gmail’s AI Inbox with conversational voice search, letting users ask Gemini to find buried email details.

21
r/LocalLLaMA community 1mo ago

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with voiceflow.com (the chatbot SaaS, name collision, sorry). Why this exists: I wanted local-only dictation and meeting transcription, because audio shouldn't have to leave the machine just to become…

13
r/LocalLLaMA community 1mo ago

Audio upscaling, cleanup, or improvement models?

I never see this type of model talked about. Are there many open models in the category? I do a lot of audio cleanup and end up using auphonic but would like to be using a local model. Edit: e.g like voice recovery, reverb removal, auto-EQ type stuff   submitted by  …

5
arXiv — Machine Learning research 1mo ago

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

arXiv:2605.16545v1 Announce Type: new Abstract: After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems…

4
arXiv — NLP / Computation & Language research 1mo ago

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

arXiv:2605.16896v1 Announce Type: new Abstract: Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a…

20
arXiv — NLP / Computation & Language research 1mo ago

LLMs for automatic annotation of Mandarin narrative transcripts

arXiv:2605.17205v1 Announce Type: new Abstract: Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown…

32
arXiv — NLP / Computation & Language research 1mo ago

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

arXiv:2605.17443v1 Announce Type: new Abstract: We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our…

29
arXiv — NLP / Computation & Language research 1mo ago

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

arXiv:2605.17652v1 Announce Type: new Abstract: There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows…

10
arXiv — NLP / Computation & Language research 1mo ago

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

arXiv:2605.17710v1 Announce Type: new Abstract: Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present…

14
arXiv — NLP / Computation & Language research 1mo ago

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

arXiv:2605.17860v1 Announce Type: new Abstract: While modern Automatic Speech Recognition (ASR) systems achieve high accuracy on benchmark corpora, their performance often degrades when there is real-world variability. This work focuses on variability arising due to accented,…

36
arXiv — NLP / Computation & Language research 1mo ago

Bridging the Gap: Converting Read Text to Conversational Dialogue

arXiv:2605.18001v1 Announce Type: new Abstract: In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing…

20
r/MachineLearning community 1mo ago

Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]

Hey everyone, I’m building a backend that analyzes long YouTube videos using an LLM. Currently, my flow is a slow waterfall: Download full audio -> Whisper -> LLM -> Return results . For a 30-minute video, the user waits forever. I want to pipeline this for real-time SSE…

5
r/LocalLLaMA community 1mo ago

21 GPU's benchmarked running a small TTS model (vram peak: 5GB)

I rented different GPUs on vast.ai for a few minutes each to benchmark a small TTS model, OmniVoice, with a peak VRAM usage of about 5 GB. I wanted to see how various mostly consumer GPUs would stack up against my own RTX 3090. This is by no means an extensive or scientific…

16
MIT Technology Review — AI news-outlet 1mo ago

Inside Anduril and Meta’s quest to make smart glasses for warfare

The defense-tech company Anduril has shared new details about the augmented-reality headset for the military it’s prototyping with Meta, including a vision for ordering drone strikes via eye-tracking and voice commands. Quay Barnett, who leads the efforts as a vice president at…

6
r/LocalLLaMA community 1mo ago

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Wanted a real head to head on the two TTS models that actually run well on CPU. Couldn't find one with proper numbers, so I ran one. Posting because the result was not what I expected going in. Quick context for anyone who hasn't seen Supertonic 3 yet: it's a flow-matching TTS…

35
Hacker News — AI on Front Page community 1mo ago

Eric Schmidt speech about AI booed during graduation

Article URL: https://www.nbcnews.com/tech/tech-news/former-google-ceo-booed-graduation-speech-ai-rcna345585 Comments URL: https://news.ycombinator.com/item?id=48177785 Points: 242 # Comments: 218

36
arXiv — NLP / Computation & Language research 1mo ago

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

arXiv:2605.15886v1 Announce Type: new Abstract: This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics…

9
arXiv — NLP / Computation & Language research 1mo ago

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

arXiv:2605.16026v1 Announce Type: new Abstract: Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language…

29

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

End-to-End Intracortical Speech Decoding from Neural Activity

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

Best architecture for seamless Bilingual TTS? (Azure / English + Korean) [D]

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Quoting Armin Ronacher

TTS Benchmark Comparison (all known TTS up until May 2026)

AI is being used to resurrect the voices of dead pilots

I fine-tuned Cohere Transcribe to support diarization and timestamps

US scrambles to stop Internet users re-creating dead pilots’ voices

Steve Wozniak cheered after telling students they have AI – actual intelligence

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

Best solution to generate reports locally with graphs, charts? Beginner question.

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

College students drown out AI-praising commencement speeches with boos

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

FormalASR: End-to-End Spoken Chinese to Formal Text

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

You can now talk to your Gmail inbox, as seen at Google IO 2026

Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

Google adds voice-based prompting to Docs and Keep

Google&#8217;s AI now lets you talk to your Gmail inbox

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Audio upscaling, cleanup, or improvement models?

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

LLMs for automatic annotation of Mandarin narrative transcripts

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

Bridging the Gap: Converting Read Text to Conversational Dialogue

Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]

21 GPU's benchmarked running a small TTS model (vram peak: 5GB)

Inside Anduril and Meta’s quest to make smart glasses for warfare

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Eric Schmidt speech about AI booed during graduation

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

Google’s AI now lets you talk to your Gmail inbox