Tag

Music

190 articles archived under #music · RSS

arXiv — Machine Learning research 16d ago

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

arXiv:2606.15436v1 Announce Type: new Abstract: Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and…

28
arXiv — NLP / Computation & Language research 16d ago

TMASC: Transmasculine Attitude and Speech Corpus

arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the…

25
Hugging Face Daily Papers research 16d ago

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Abstract A novel open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications through a frozen reward mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce…

5
r/MachineLearning community 16d ago

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first…

6
r/LocalLLaMA community 16d ago

What do you guys think about Unsloth Studio?

As a person who has gone through more AI frontend than one goes through socks, I have really appreciated the Unsloth frontend. It's anything I could ever need and it supports Diffusion Gemma! It has easy options to enable tensor parallelism and much more. Have you guys tried it…

33
r/LocalLLaMA community 16d ago

I think we need a /LocalHarnessLLM or something ...

LM Studio Hermes Qwen Code Odysseus Open Claw Open Code Claude Code (and then IDEs w/ agentic capabilities) Continue Rider VS Code And a dozen others I'm sure ... Would love a place to discuss these? If not a new subreddit, a new discord section in localllama discord? I've made…

24
arXiv — Machine Learning research 17d ago

Beyond task performance: Decoding bioacoustic embeddings with speech features

arXiv:2606.14662v1 Announce Type: new Abstract: Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species…

6
arXiv — NLP / Computation & Language research 17d ago

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

arXiv:2606.13993v1 Announce Type: new Abstract: A crucial aspect of linguistic capability is the ability to trade off between stored representations and abstract knowledge: one must retrieve learned representations, but also generate novel ones by applying productive rules.…

34
arXiv — NLP / Computation & Language research 17d ago

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

arXiv:2606.14694v1 Announce Type: new Abstract: Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and…

4
arXiv — NLP / Computation & Language research 17d ago

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

arXiv:2606.14141v1 Announce Type: cross Abstract: Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source…

12
arXiv — NLP / Computation & Language research 17d ago

A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

arXiv:2606.14230v1 Announce Type: cross Abstract: Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly…

19
Simon Willison community 19d ago

OpenAI WebRTC Audio Session, now with document context

OpenAI WebRTC Audio Session, now with document context I built the first version of this tool in December 2024 to try out the then-new OpenAI WebRTC API for interacting with their realtime audio models. Last month OpenAI introduced a brand new model to that API called…

9
Hugging Face Daily Papers research 20d ago

PianoKontext: Expressive Performance Rendering from Deadpan Context

Abstract PianoKontext generates variable-length piano performances by aligning MIDI scores with audio in latent space using DTW and DiT blocks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Expressive performance rendering (EPR) aims to generate realistic performances constrained…

12
r/LocalLLaMA community 20d ago

Why hasn't any mainstream game integrated LLMs into NPCs yet?

tech demos exist but nothing's actually shipped in a real game. Is it a latency problem or are game studios just not interested~   submitted by   /u/Enough-Astronaut9278 [link]   [comments]

29
arXiv — NLP / Computation & Language research 20d ago

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

arXiv:2606.13322v1 Announce Type: new Abstract: We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines…

13
arXiv — NLP / Computation & Language research 20d ago

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech…

30
r/LocalLLaMA community 21d ago

Infinite Music Glitch on my Arduino with Magenta Realtime 2

I built a local voice AI realtime music setup where my ESP32 microcontroller talks to my MacBook over WebSockets. The microcontroller is just a tiny Arduino-based device with a mic and speaker, and the MacBook M4 Pro runs Magenta Realtime 2 locally and streams the audio back to…

38
arXiv — NLP / Computation & Language research 21d ago

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

arXiv:2606.11219v1 Announce Type: new Abstract: Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains…

32
arXiv — NLP / Computation & Language research 21d ago

Pretrained self-supervised speech models can recognize unseen consonants

arXiv:2606.11542v1 Announce Type: new Abstract: Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource…

17
r/LocalLLaMA community 21d ago

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

I kept wanting to talk to my local models instead of typing, but every voice setup wanted a GPU, shipped my audio to the cloud, or was macOS-only. So I built one that's none of those — and I benchmarked it, so these are real measured numbers, not vibes. One command installs the…

12
r/LocalLLaMA community 22d ago

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single…

31
arXiv — NLP / Computation & Language research 22d ago

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

arXiv:2606.06037v2 Announce Type: cross Abstract: Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability…

29
arXiv — NLP / Computation & Language research 22d ago

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

arXiv:2606.09846v1 Announce Type: cross Abstract: Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated…

6
arXiv — NLP / Computation & Language research 22d ago

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

arXiv:2606.10147v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the…

30
Google DeepMind official-blog 22d ago

Fluid, natural voice translation with Gemini 3.5 Live Translate

Gemini 3.5 Live Translate brings near real-time, natural speech translation to Google AI Studio, Google Translate and Google Meet.

32
Hugging Face Daily Papers research 22d ago

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Abstract Research demonstrates that hallucinations in Whisper ASR can be detected and reduced using internal representations from audio encoder activations and Sparse AutoEncoder latents, achieving significant hallucination rate reduction with minimal speech transcription…

20
Hugging Face Daily Papers research 23d ago

EMMA: Extracting Multiple physical parameters from Multimodal Data

Abstract EMMA is a physics-informed multimodal framework that directly recovers dynamical parameters from raw video, audio, and image data using a Liquid Time-Constant network and physics-constrained loss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce EMMA, a…

33
llama.cpp releases dev-tools 23d ago

b9555

metal : fix im2col 1D case (audio models) ( #24220 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64…

29
Hugging Face Daily Papers research 24d ago

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

Abstract Confidence-based loss weighting via entropy-derived log-barrier enables improved audio generation through adaptive gradient scaling in supervised diffusion training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Confidence-based loss weighting is usually avoided in…

36
Hugging Face Daily Papers research 24d ago

MMAE: A Massive Multitask Audio Editing Benchmark

Abstract MMAE presents a comprehensive benchmark for instruction-based audio editing across multiple modalities and complexity levels, revealing significant gaps in current model capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce MMAE, a Massive Multitask…

24
arXiv — Machine Learning research 24d ago

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

arXiv:2606.07387v1 Announce Type: new Abstract: State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose…

15
arXiv — NLP / Computation & Language research 24d ago

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

arXiv:2606.06743v1 Announce Type: cross Abstract: The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main…

21
arXiv — NLP / Computation & Language research 24d ago

MMAE: A Massive Multitask Audio Editing Benchmark

arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,…

8
arXiv — NLP / Computation & Language research 24d ago

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question…

14
arXiv — NLP / Computation & Language research 24d ago

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

arXiv:2606.07356v1 Announce Type: cross Abstract: Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free…

25
r/LocalLLaMA community 24d ago

Dockerized Nemotron 3.5 ASR — Switched from Parakeet, better multilingual support + streaming (4.5x realtime speed on cpu)

I was originally using Parakeet for my speech recognition pipeline but decided to give Nemotron 3.5 a shot. After testing it on some multilingual audio clips, it's been working great so far. What sold me: - Better language support (40+ locales from one model) - Native streaming…

17
r/LocalLLaMA community 25d ago

Gemma4 12B - Experiences?

Anyone check out the new Gemma4 12B that dropped 3 days ago? Integrated vision and audio recognition, no mmpro needed plus tool use. Q4 quant is like 8gb RAM. Crazy fast and great quality for it's size. No, it's not as good as a 27B or 31B. But it's damn close. Curious what…

24
r/LocalLLaMA community 25d ago

Best Coding Harness for Qwen3.6 35B?

I've been happily using GitHub Copilot for 7-8 months, primarily in Visual Studio and VS Code, mostly with the built-in flagship models and have felt like the output is worth the cost. Lately I've been playing with a lot of different local LLM models and decided to try using…

32
r/LocalLLaMA community 26d ago

I just realized how good MoE models are for consumer hardware

I've been tinkering around with LLM for a while now, started with LM Studio like probably all of us and wanted to go into headless selhosted model so that I can use my macbook and still use my AI models. I've been using Qwen 3.6 (and 3.5) 27B on my main computer which has a…

7
r/MachineLearning community 26d ago

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Sharing a small CPU inference benchmark for nvidia/parakeet-tdt-0.6b-v3 that turned up a result I didn't expect going in. Setup: 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU. Test audio: 16.78s Harvard sentences at 16kHz mono. Results: Inference path RTF Peak Memory CPU…

26
r/LocalLLaMA community 27d ago

Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside)

I completed a Python bug hunting benchmark with Gemma 4 12B. I used the Unsloth Dynamic Q5 GGUF model. The model has good capabilities. Default settings in LM Studio disable the reasoning. Fix the LM Studio reasoning configuration. LM Studio looks for Qwen tokens. Gemma 4 uses…

30
Hugging Face Daily Papers research 27d ago

Multimodal Music Recommendation System using LLMs

Abstract A multimodal framework for session-based music recommendation integrates audio, lyric, and semantic signals with LLM-based sequential reasoning to improve recommendation accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Music recommendation systems typically treat…

16
arXiv — NLP / Computation & Language research 27d ago

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four…

5
arXiv — NLP / Computation & Language research 27d ago

Forgive or forget: Understanding the context of hate in audio retrieval systems

arXiv:2606.05857v1 Announce Type: new Abstract: Handling toxic retrieval in text-to-audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing…

27
arXiv — NLP / Computation & Language research 27d ago

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both.…

21
r/LocalLLaMA community 27d ago

Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.

  submitted by   /u/FerretLegitimate6929 [link]   [comments]

31
llama.cpp releases dev-tools 28d ago

b9503

fix(mtmd): handle Gemma 4 audio projector embedding size ( #24091 ) mtmd: handle Gemma 4 audio projector embedding size rm projection_dim from clip_n_mmproj_embd Co-authored-by: Xuan Son Nguyen son@huggingface.co macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64,…

28
arXiv — NLP / Computation & Language research 28d ago

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

arXiv:2606.04205v1 Announce Type: cross Abstract: The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available…

6
arXiv — NLP / Computation & Language research 28d ago

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

arXiv:2606.04418v1 Announce Type: cross Abstract: Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency,…

6
Hugging Face Daily Papers research 28d ago

Audio Interaction Model

Abstract A unified streaming audio model is developed that combines offline task execution with real-time audio instruction following through an end-to-end framework supporting multiple audio interaction capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Audio is an…

20

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

TMASC: Transmasculine Attitude and Speech Corpus

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

What do you guys think about Unsloth Studio?

I think we need a /LocalHarnessLLM or something ...

Beyond task performance: Decoding bioacoustic embeddings with speech features

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

OpenAI WebRTC Audio Session, now with document context

PianoKontext: Expressive Performance Rendering from Deadpan Context

Why hasn't any mainstream game integrated LLMs into NPCs yet?

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

Infinite Music Glitch on my Arduino with Magenta Realtime 2

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Pretrained self-supervised speech models can recognize unseen consonants

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Fluid, natural voice translation with Gemini 3.5 Live Translate

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

EMMA: Extracting Multiple physical parameters from Multimodal Data

b9555

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

MMAE: A Massive Multitask Audio Editing Benchmark

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

MMAE: A Massive Multitask Audio Editing Benchmark

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

Dockerized Nemotron 3.5 ASR — Switched from Parakeet, better multilingual support + streaming (4.5x realtime speed on cpu)

Gemma4 12B - Experiences?

Best Coding Harness for Qwen3.6 35B?

I just realized how good MoE models are for consumer hardware

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside)

Multimodal Music Recommendation System using LLMs

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

Forgive or forget: Understanding the context of hate in audio retrieval systems

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.

b9503

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

Audio Interaction Model