Tag

Voice

391 articles archived under #voice · RSS

arXiv — NLP / Computation & Language research 1mo ago

Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed?

arXiv:2606.01298v1 Announce Type: new Abstract: The spread of hate speech has become increasingly harmful in modern digital environments, particularly on social networking platforms. While recent advances have shown promising results in automatic hate speech detection, a key…

34
r/MachineLearning community 1mo ago

Full duplex vs half duplex - the spectrum of AI voice models [D]

It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today. Full-duplex: two channels, both sides can talk at…

32
r/MachineLearning community 1mo ago

Real-time multilingual ASR using rolling buffers and monolingual models [P]

I built a routing-based approach to lightweight real-time multilingual ASR as part of my research at Gladia. The core problem was how multilingual models that accurately handle mid-conversation language switches are often too big for most local hardware and have poor accuracy.…

36
Vercel — AI dev-tools 1mo ago

Chat SDK adds AgentPhone support

Chat SDK now supports AgentPhone with the new vendor-official adapter . Give your bot its own phone number so it can handle voice calls and text messages using the same handlers you already write. When a call ends, the transcript is delivered as a message, allowing your bot to…

14
arXiv — NLP / Computation & Language research 1mo ago

Your Multimodal Speech Model Says I Have a Face for Radio

arXiv:2605.30472v1 Announce Type: new Abstract: As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to…

35
arXiv — NLP / Computation & Language research 1mo ago

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

arXiv:2605.30608v1 Announce Type: new Abstract: Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is…

4
arXiv — NLP / Computation & Language research 1mo ago

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

arXiv:2605.31432v1 Announce Type: new Abstract: Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based…

37
arXiv — NLP / Computation & Language research 1mo ago

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

arXiv:2605.31469v1 Announce Type: new Abstract: Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint…

17
Hugging Face Daily Papers research 1mo ago

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Abstract Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations. AI-generated summary Recent advances in speech generation have enabled…

5
Hugging Face Daily Papers research 1mo ago

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Abstract A zero-shot text-to-speech system called SwanVoice is presented that addresses expressive long-form multi-speaker dialogue synthesis by combining VAE, flow-matching DiT, and diffusion post-training techniques. AI-generated summary Zero-shot text-to-speech (TTS) has…

6
r/MachineLearning community 1mo ago

Arabic ASR model struggling to converge during training [D]

i'm trying to train an ASR model using the LibriSpeech recipe from SpeechBrain (without the language model) on a 100-hour dataset of dialectal Arabic speech. the model architecture uses a Conformer-small encoder and a Transformer decoder, with a total of around 13M parameters.…

23
r/LocalLLaMA community 1mo ago

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

I ported NVIDIA's Parakeet speech-to-text models to pure C++/ggml (the engine behind llama.cpp and whisper.cpp). It runs the FastConformer TDT / CTC / RNNT / hybrid models with no Python and no PyTorch, on CPU and GPU (CUDA, HIP, Vulkan, Metal). The goal was to match NeMo…

30
r/LocalLLaMA community 1mo ago

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities. coder3101's variant achieves 96% ASR with capability…

17
TechCrunch — AI news-outlet 1mo ago

SoftBank says it will invest up to €75 billion to build French data centers

The goal, the firm said, is to develop and operate up to 5 gigawatts of additional data center capacity.

30
The Information — AI news-outlet 1mo ago

Softbank to Invest Up To 75 Billion Euros on AI Data Centers in France

SoftBank Group announced a commitment to develop and operate five gigawatts of AI data center capacity in France, with an investment of up to 75 billion euros, or about $87.5 billion. The commitment is SoftBank’s largest AI infrastructure investment to date in Europe, the…

13
r/LocalLLaMA community 1mo ago

Whisper.cpp is underwhelming

Hi, I'm running whisper.cpp with the best model I could find (ggml-large-v3) but after about 20 min of transcription it hallucinates a sentence that it will repeat endlessly until the end. Is there something I'm missing or should I cut my files to about 20 minutes length?  …

17
r/LocalLLaMA community 1mo ago

STT -> LLM -> TTS pipeline

Hey guys, I’m trying to learn about how to better create a STT LLM TTS pipeline. My current setup is running a 3090 on Ubuntu. I use llama.cpp to run Qwen 3.6 27B Q4 with pi-agent for tool calling, and I just run everything in the terminal, I haven’t really bothered with chat…

25
r/LocalLLaMA community 1mo ago

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Under $1000 for 32gb vram from 2023, and ~300 watts draw... and this thing is outperforming the latest pick-your-vendor $5k mini pcs from 2026. So.. next question is can I make it squeeze 150 t/s with the same q4xl on cuda 13.3 this weekend. Anyone try it yet?   submitted by…

13
r/LocalLLaMA community 1mo ago

Fulloch V2: 100% Local Voice Assistant for Home Assistant & Obsidian (Runs on 16GB VRAM)

Hey everyone, following up on my r/LocalLLaMA post from a while back, I have spent some time testing how far I can push my 5060ti as a personal voice assistant. The stack is Qwen3.5-9B GGUF Q5_K_M, Qwen3-1.7B ASR, and Qwen3-1.7B TTS, delivering fast, real-time responses with…

21
r/LocalLLaMA community 1mo ago

made a local voice AI for windows you can talk to in any language. open source, bring your own key

been building this on and off for a while and finally got it to a point where i'm not embarrassed to share it, so here goes. it's called Shadow AI. basically a voice-first AI companion that runs on your own windows machine. you just talk to it and it talks back, no typing…

38
r/LocalLLaMA community 1mo ago

this new Moss tts 1.5 is damn good with voice cloning

https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5 I prefer this over fish audio s2 pro because fish audio dont allow commercial use Long Cat DiT 3.5 is also a another good model.   submitted by   /u/9r4n4y [link]   [comments]

38
Hugging Face Daily Papers research 1mo ago

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Abstract A novel convex optimization framework for language detection in spoken dialogue systems that achieves high accuracy with efficient training and theoretical guarantees against dialectal variations under low-resource conditions. AI-generated summary Globalization and…

21
r/LocalLLaMA community 1mo ago

We gave a Reachy Mini a real-time voice brain

We attended an event the other day and found this little guy lying on our desk, a Reachy Mini from Hugging Face. It belongs to the daughter of the event organizer. We got curious about how it worked, and an hour later we'd given it a brain. The model basically becomes Reachy. It…

19
Hugging Face Daily Papers research 1mo ago

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Abstract ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models. AI-generated summary We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals…

15
arXiv — Machine Learning research 1mo ago

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

arXiv:2605.29543v1 Announce Type: new Abstract: Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This…

10
arXiv — Machine Learning research 1mo ago

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

arXiv:2605.29659v1 Announce Type: new Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail…

4
arXiv — NLP / Computation & Language research 1mo ago

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv:2605.28833v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for…

12
arXiv — NLP / Computation & Language research 1mo ago

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

arXiv:2605.29188v1 Announce Type: new Abstract: Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement…

5
arXiv — NLP / Computation & Language research 1mo ago

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

arXiv:2605.27376v1 Announce Type: new Abstract: While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use…

34
arXiv — NLP / Computation & Language research 1mo ago

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

arXiv:2605.27383v1 Announce Type: new Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by…

24
arXiv — NLP / Computation & Language research 1mo ago

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

arXiv:2605.27808v1 Announce Type: new Abstract: Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech…

14
arXiv — NLP / Computation & Language research 1mo ago

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

arXiv:2605.27874v1 Announce Type: new Abstract: Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the…

13
arXiv — NLP / Computation & Language research 1mo ago

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

arXiv:2605.27984v1 Announce Type: new Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment…

10
arXiv — NLP / Computation & Language research 1mo ago

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

arXiv:2605.28211v1 Announce Type: new Abstract: SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify…

6
arXiv — NLP / Computation & Language research 1mo ago

Why We Need Speech to Evaluate Speech Translation

arXiv:2605.28227v1 Announce Type: new Abstract: Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and…

35
arXiv — NLP / Computation & Language research 1mo ago

Building Community-Centred NLP Resources for Puno Quechua

arXiv:2605.28253v1 Announce Type: new Abstract: The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for…

30
The Information — AI news-outlet 1mo ago

The Hot New Way to Communicate with AI? Whispering

If you wander into the Manhattan office of AI startup Basis on a workday, you’ll see most of its 100 or so staffers whispering quietly into gooseneck microphones at their desks. They aren’t taking phone calls or talking with other humans at all. They’re speaking softly to their…

23
r/MachineLearning community 1mo ago

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice,…

31
arXiv — NLP / Computation & Language research 1mo ago

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

arXiv:2605.26978v1 Announce Type: new Abstract: Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target…

13
arXiv — NLP / Computation & Language research 1mo ago

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments…

27
arXiv — NLP / Computation & Language research 1mo ago

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

arXiv:2605.27030v1 Announce Type: new Abstract: Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated…

10
arXiv — NLP / Computation & Language research 1mo ago

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

arXiv:2605.27062v1 Announce Type: new Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented…

31
arXiv — NLP / Computation & Language research 1mo ago

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

arXiv:2605.27189v1 Announce Type: new Abstract: This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate…

7
Hacker News — AI on Front Page community 1mo ago

Uber, Lyft drivers in Massachusetts form first US ride-share union

Article URL: https://www.reuters.com/business/world-at-work/uber-lyft-drivers-massachusetts-form-first-us-ride-share-union-2026-05-26/ Comments URL: https://news.ycombinator.com/item?id=48281509 Points: 220 # Comments: 118

37
r/LocalLLaMA community 1mo ago

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0 . It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For…

10
Hugging Face Daily Papers research 1mo ago

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Abstract ASASR addresses spectral misalignment in image super-resolution by leveraging Riemannian geometry and adversarial training to improve structural fidelity and reduce artifacts. AI-generated summary Generative priors in Image Super-Resolution (SR) often compromise…

10
r/LocalLLaMA community 1mo ago

Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality?

I’m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isn’t where I need it. I like AssemblyAI’s quality and want something self-hosted that: - Is clearly better than Whisper Large V3 Turbo - Can match or get close to AssemblyAI’s transcription quality -…

11
r/LocalLLaMA community 1mo ago

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stuff and I've seen before questions about NPUs, that are often dismissed as…

9
arXiv — Machine Learning research 1mo ago

Hardware-Aware Federated Learning for Speech Emotion Recognition

arXiv:2605.24712v1 Announce Type: new Abstract: Federated learning (FL) enables privacy-preserving collaborative training across distributed edge devices, but real deployments involve heterogeneous clients with different processing power, memory capacity, and communication…

16
arXiv — NLP / Computation & Language research 1mo ago

Raon-Speech Technical Report

arXiv:2605.23912v1 Announce Type: new Abstract: We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural…

11

Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed?

Full duplex vs half duplex - the spectrum of AI voice models [D]

Real-time multilingual ASR using rolling buffers and monolingual models [P]

Chat SDK adds AgentPhone support

Your Multimodal Speech Model Says I Have a Face for Radio

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Arabic ASR model struggling to converge during training [D]

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

SoftBank says it will invest up to €75 billion to build French data centers

Softbank to Invest Up To 75 Billion Euros on AI Data Centers in France

Whisper.cpp is underwhelming

STT -> LLM -> TTS pipeline

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Fulloch V2: 100% Local Voice Assistant for Home Assistant & Obsidian (Runs on 16GB VRAM)

made a local voice AI for windows you can talk to in any language. open source, bring your own key

this new Moss tts 1.5 is damn good with voice cloning

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

We gave a Reachy Mini a real-time voice brain

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

Why We Need Speech to Evaluate Speech Translation

Building Community-Centred NLP Resources for Puno Quechua

The Hot New Way to Communicate with AI? Whispering

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

Uber, Lyft drivers in Massachusetts form first US ride-share union

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality?

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

Hardware-Aware Federated Learning for Speech Emotion Recognition

Raon-Speech Technical Report