Tag

Voice

390 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 20d ago

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

arXiv:2606.13464v1 Announce Type: new Abstract: Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires…

11
arXiv — NLP / Computation & Language research 20d ago

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech…

30
arXiv — NLP / Computation & Language research 20d ago

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

arXiv:2606.13630v1 Announce Type: new Abstract: The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for…

22
arXiv — NLP / Computation & Language research 20d ago

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

arXiv:2606.13544v1 Announce Type: cross Abstract: Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice…

35
r/LocalLLaMA community 21d ago

How I implemented ASR bias for voice transcription models [Open Source]

I've been spending the last couple of weeks building a Wispr Flow clone as an open source project. For context, it is a voice dictation app that lets you type faster, by speaking instead of actually typing. I spent the first week building the basic STT capabilities. One of the…

29
r/LocalLLaMA community 21d ago

Infinite Music Glitch on my Arduino with Magenta Realtime 2

I built a local voice AI realtime music setup where my ESP32 microcontroller talks to my MacBook over WebSockets. The microcontroller is just a tiny Arduino-based device with a mic and speaker, and the MacBook M4 Pro runs Magenta Realtime 2 locally and streams the audio back to…

38
Smol AI News news-outlet 21d ago

not much happened today

**Anthropic's Fable/Mythos export-control crisis** dominates AI news, highlighting the intersection of **national security** and frontier model access. Technical voices like **François Chollet** criticize opaque regulatory actions and advocate for **standardized benchmarks for…

6
arXiv — NLP / Computation & Language research 21d ago

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

arXiv:2606.11219v1 Announce Type: new Abstract: Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains…

32
arXiv — NLP / Computation & Language research 21d ago

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

arXiv:2606.11386v1 Announce Type: new Abstract: Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains…

28
arXiv — NLP / Computation & Language research 21d ago

Pretrained self-supervised speech models can recognize unseen consonants

arXiv:2606.11542v1 Announce Type: new Abstract: Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource…

17
arXiv — NLP / Computation & Language research 21d ago

Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

arXiv:2606.11639v1 Announce Type: new Abstract: The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies…

18
arXiv — NLP / Computation & Language research 21d ago

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

arXiv:2606.11681v1 Announce Type: new Abstract: We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the…

15
arXiv — NLP / Computation & Language research 21d ago

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

arXiv:2606.11197v1 Announce Type: cross Abstract: Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has…

19
arXiv — NLP / Computation & Language research 21d ago

Massive Open-Vocabulary Keyword Spotting

arXiv:2606.11279v1 Announce Type: cross Abstract: Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with…

36
arXiv — NLP / Computation & Language research 21d ago

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

arXiv:2606.11429v1 Announce Type: cross Abstract: Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end…

29
r/LocalLLaMA community 21d ago

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

I kept wanting to talk to my local models instead of typing, but every voice setup wanted a GPU, shipped my audio to the cloud, or was macOS-only. So I built one that's none of those — and I benchmarked it, so these are real measured numbers, not vibes. One command installs the…

12
Hugging Face Daily Papers research 22d ago

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Abstract Sparse autoencoders trained on language model representations reveal interpretable features for speech synthesis that can be manipulated to control linguistic and prosodic attributes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models increasingly serve as the…

19
r/LocalLLaMA community 22d ago

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single…

31
arXiv — Machine Learning research 22d ago

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

arXiv:2606.09962v1 Announce Type: new Abstract: Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers…

14
arXiv — NLP / Computation & Language research 22d ago

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

arXiv:2606.10029v1 Announce Type: cross Abstract: Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train…

22
arXiv — NLP / Computation & Language research 22d ago

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

arXiv:2606.10581v1 Announce Type: new Abstract: Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs)…

11
arXiv — NLP / Computation & Language research 22d ago

Speaker Group Encoding in Self-supervised Speech Recognition Models

arXiv:2606.10654v1 Announce Type: new Abstract: We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech…

10
arXiv — NLP / Computation & Language research 22d ago

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

arXiv:2606.10675v1 Announce Type: new Abstract: We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech…

34
arXiv — NLP / Computation & Language research 22d ago

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

arXiv:2606.11167v1 Announce Type: new Abstract: Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level…

23
arXiv — NLP / Computation & Language research 22d ago

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

arXiv:2606.06037v2 Announce Type: cross Abstract: Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability…

29
arXiv — NLP / Computation & Language research 22d ago

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

arXiv:2606.10439v1 Announce Type: cross Abstract: The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work…

21
arXiv — NLP / Computation & Language research 22d ago

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

arXiv:2606.10475v1 Announce Type: cross Abstract: Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the…

18
r/LocalLLaMA community 22d ago

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Built a decision-reasoning engine (Orlog) and wanted to fine-tune a local model for it instead of paying per-call forever. The method (DV-DPO): Run a 3-voice council on each question, produce a synthesis Cross-examine: losing voices challenge the synthesis If synthesis gets…

35
Vercel — AI dev-tools 22d ago

Threshold billing is now enabled for Pro teams

Threshold billing now sends Pro teams a partial invoice mid-cycle once on-demand usage reaches a threshold, instead of holding all charges until the end of the billing period. Partial invoices and the end-of-cycle invoice add up to your total usage, so the same usage is never…

15
r/MachineLearning community 22d ago

iOS 27 Siri is using WaveRNN and FastSpeech2 [D]

Found from iOS Simulator's files. Both of them are in espresso format There's also another compiled CoreML for concert ranking and based on the content inside of it looks like to be a simple logistic regression. See…

38
TechCrunch — AI news-outlet 22d ago

Hey Siri, here’s what I actually want from AI

I'm desperate for a personal AI assistant, but do I really want to become the kind of person who can't function without the friendly robot voice in my phone?

4
Hugging Face official-blog 22d ago

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Back to Articles Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech Enterprise Article Published June 9, 2026 Upvote 4 Shama Gupta shamagupta ServiceNow-AI Lindsay Brin lindsaybrin ServiceNow-AI Fanny Riols FannyRiols ServiceNow-AI…

11
Ars Technica — AI news-outlet 22d ago

Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation

Voice translations preserve speaker's tone, pacing, pitch—with SynthID watermarks for security.

16
llama.cpp releases dev-tools 22d ago

b9585

graph: Fix granite speech model inference by applying embedding scale when deepstack is not used ( #24357 ) llama-graph : apply embedding scale when deepstack is not used nits: remove non-existant hunyuan-vl from the tests apply suggestion from @gabe-l-hart Co-authored-by: Xuan…

25
r/MachineLearning community 22d ago

What will be the next breakthrough in ASR? [D]

Hey All, I am currently working on ASR models, and I have gathered some recent literature. From my literature search, it seems like the ASR models are getting more and more powerful due to two main things. Because pseudo-labelled data is growing, supervised models are rising…

35
r/LocalLLaMA community 22d ago

Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)

Thank you to everyone who contributed to my previous post, providing feedback and various models to add, and questioning the rating system. You can now participate in a live blind voting to create a proper ELO for all the models that are added. Each new model that we add will…

23
The Information — AI news-outlet 22d ago

Broadcom to Help Finance Anthropic, OpenAI Chip Deals With Apollo, Blackstone

Broadcom said Tuesday that it is launching a new fund—backed by Apollo and Blackstone—to help finance more than 20 gigawatts of AI data centers through 2028 using chips designed by Broadcom, including projects tied to Anthropic and OpenAI. Apollo will lead an initial $35 billion…

19
Google DeepMind official-blog 22d ago

Fluid, natural voice translation with Gemini 3.5 Live Translate

Gemini 3.5 Live Translate brings near real-time, natural speech translation to Google AI Studio, Google Translate and Google Meet.

32
NVIDIA Developer Blog official-blog 22d ago

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine,...

9
r/LocalLLaMA community 23d ago

PSA: Throttle GPU power limits, with minor performance deficits

I just feel i need to post this here again so more people see: Test around with throttling the power limits of your GPUs, you will often find that you can save tons of power with only minor performance deficits. On my dual Radeon VII setup, i went from 250 to 100 watts per card,…

11
Hugging Face Daily Papers research 23d ago

Liberating LLM Capabilities in Full-Duplex Speech Models

Abstract A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-based large language…

21
Hugging Face Daily Papers research 23d ago

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Abstract Research demonstrates that hallucinations in Whisper ASR can be detected and reduced using internal representations from audio encoder activations and Sparse AutoEncoder latents, achieving significant hallucination rate reduction with minimal speech transcription…

20
OpenAI official-blog 23d ago

What Codex unlocks for Notion

How Notion uses Codex to one-shot specs, build AI Voice Input for the web, and multiply engineering power across small teams.

26
arXiv — Machine Learning research 23d ago

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

arXiv:2606.07610v1 Announce Type: new Abstract: State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful…

6
r/LocalLLaMA community 23d ago

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The…

14
The Information — AI news-outlet 23d ago

Apple Tries for Another Siri Reboot

Apple launched a much anticipated new version of its Siri voice assistant at the start of its annual developer conference on Monday, which users will be able to access through a new Siri app. The refreshed voice assistant, now called Siri AI, which uses Google’s Gemini models,…

31
Ars Technica — AI news-outlet 23d ago

Say hi to "Siri AI"—Apple announces new, more "conversational" voice assistant

New features coming this fall alongside two-tiered, Google-powered AI model overhaul.

6
TechCrunch — AI news-outlet 23d ago

Apple’s long-awaited AI Siri overhaul is finally here

The idea behind the new "Siri AI" is to turn the assistant from a voice controlled assistant into an AI companion that can do a lot more.

30
Hacker News — AI on Front Page community 23d ago

Massachusetts bans sale of precise location data in new privacy rights bill

Article URL: https://techcrunch.com/2026/06/08/massachusetts-votes-to-pass-new-privacy-rights-bill-that-bans-sale-of-precise-location-data/ Comments URL: https://news.ycombinator.com/item?id=48448012 Points: 214 # Comments: 34

29
Hugging Face Daily Papers research 24d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

How I implemented ASR bias for voice transcription models [Open Source]

Infinite Music Glitch on my Arduino with Magenta Realtime 2

not much happened today

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Pretrained self-supervised speech models can recognize unseen consonants

Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

Massive Open-Vocabulary Keyword Spotting

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

Speaker Group Encoding in Self-supervised Speech Recognition Models

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Threshold billing is now enabled for Pro teams

iOS 27 Siri is using WaveRNN and FastSpeech2 [D]

Hey Siri, here&#8217;s what I actually want from AI

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation

b9585

What will be the next breakthrough in ASR? [D]

Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)

Broadcom to Help Finance Anthropic, OpenAI Chip Deals With Apollo, Blackstone

Fluid, natural voice translation with Gemini 3.5 Live Translate

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

PSA: Throttle GPU power limits, with minor performance deficits

Liberating LLM Capabilities in Full-Duplex Speech Models

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

What Codex unlocks for Notion

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Apple Tries for Another Siri Reboot

Say hi to "Siri AI"&#8212;Apple announces new, more "conversational" voice assistant

Apple&#8217;s long-awaited AI Siri overhaul is finally here

Massachusetts bans sale of precise location data in new privacy rights bill

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Hey Siri, here’s what I actually want from AI

Say hi to "Siri AI"—Apple announces new, more "conversational" voice assistant

Apple’s long-awaited AI Siri overhaul is finally here