Tag

Music

190 articles archived under #music · RSS

r/LocalLLaMA community 28d ago

How to use audio and vision modalities in llama.cpp?

How to use audio and vision modalities in llama.cpp with Gemma4 12B it? I’m on release b9494, but when I run llama-cli it shows “modalities: text” only, and crashes if I try to add an image.   submitted by   /u/No-Leave-4512 [link]   [comments]

20
r/LocalLLaMA community 28d ago

Best way to index full Italian Wikipedia for 100% offline RAG in LM Studio?

Hi everyone, I want to set up a 100% offline RAG system using LM Studio and the entire Italian Wikipedia (text-only, no images). My goal is to index the database once so my local LLMs can query it for up-to-date factual knowledge without internet access. Here are my PC specs:…

14
r/LocalLLaMA community 28d ago

google/gemma-4-12B · Hugging Face

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned…

29
Hugging Face Daily Papers research 29d ago

MERIT: Learning Disentangled Music Representations for Audio Similarity

Abstract MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Current music similarity models…

21
Vercel — AI dev-tools 29d ago

Grok Imagine Video 1.5 on AI Gateway

Grok Imagine Video 1.5 from xAI is now available on AI Gateway. The model generates video from an input image with synchronized audio in a single pass. This release improves audio quality, prompt following, and photorealism. Face accuracy and character consistency are stronger…

26
r/LocalLLaMA community 29d ago

Benchmarks of 20 small LLMs on a 6GB RTX 4050

I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to the LM studio database and searched many variants from the same family, trying to select…

37
r/LocalLLaMA community 1mo ago

NVIDIA releases Cosmos 3 Omnimodal world modelson HF

https://huggingface.co/nvidia/Cosmos3-Super-Text2Image Nano: 16B Super: 64B Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory…

7
r/LocalLLaMA community 1mo ago

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved…

20
arXiv — NLP / Computation & Language research 1mo ago

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

arXiv:2606.00579v1 Announce Type: new Abstract: As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed…

37
Hugging Face Daily Papers research 1mo ago

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Abstract StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage…

8
Smol AI News news-outlet 1mo ago

not much happened today

**NVIDIA** led open-source AI model releases with **Cosmos 3**, a comprehensive omnimodal world model unifying language, image, video, audio, and action using a Mixture-of-Transformers design, and **Nemotron 3 Ultra**, a **550B** parameter open-weight model noted for high…

33
Hugging Face Daily Papers research 1mo ago

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Abstract SwanSphere presents a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies. AI-generated summary Real-time and accurate spatial…

25
r/LocalLLaMA community 1mo ago

Llama Studio v0.2.0

I have made an update to my llama-server WebUI based on some awesome feedback and interaction with the community. 1) JSON model config replaced by per-model shell scripts. Run from CLI, paste from unsloth, email to your buddy or post to reddit: Using real shell scripts to store…

17
r/LocalLLaMA community 1mo ago

<Think> toggle button for llama.cp web chat for QWEN3.6

https://preview.redd.it/od6suf6j7g4h1.png?width=619&format=png&auto=webp&s=d31fb903ea68f58e3a641bfd275d59eeb5cce445 Missing a button in llama-serve webchat to toggle reasoning on/off like in LM Studio? This is a snippet that runs in https://www.tampermonkey.net/ a browser…

34
r/LocalLLaMA community 1mo ago

Open source : Turning vocal imitations into sound effects. (New UX for sound generation)

Hello guys I want to introduce my new project! Have you ever needed a specific sound while making a video or a game? You know exactly what it sounds like in your head, but have no idea how to search for it. That’s why sound design meetings at game studios often turn into people…

12
r/LocalLLaMA community 1mo ago

this new Moss tts 1.5 is damn good with voice cloning

https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5 I prefer this over fish audio s2 pro because fish audio dont allow commercial use Long Cat DiT 3.5 is also a another good model.   submitted by   /u/9r4n4y [link]   [comments]

38
r/LocalLLaMA community 1mo ago

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Hot takes: - Mac studio is overpriced Raspberry Pi that is way more inefficient than people think (together with most macs). M5 MBP is better with the "tensor" MMA, but not by much. - Spark was actually decent when it was just 3-4k. Strix is obviously much better now - 3090 are…

26
r/LocalLLaMA community 1mo ago

Unsloth Studio updated to support training with MLX on macs

The title says it all. I noticed this morning when reviewing Unsloth Studio github that training with MLX is now fully supported. Not sure when this was added but must have been within the last couple of weeks since last I checked it said "coming soon." I haven't personally…

36
Hugging Face Daily Papers research 1mo ago

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Abstract ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models. AI-generated summary We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals…

15
arXiv — Machine Learning research 1mo ago

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

arXiv:2605.29202v1 Announce Type: new Abstract: Recent advances in text-to-music generation enable high-fidelity synthesis of structured musical audio, raising growing concerns about data provenance, consent, and training transparency. These models are typically trained on…

29
arXiv — NLP / Computation & Language research 1mo ago

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

arXiv:2605.29300v1 Announce Type: new Abstract: Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored.…

8
llama.cpp releases dev-tools 1mo ago

b9393

mtmd: fix gemma 4 audio rms norm eps ( #23815 ) mtmd: fix gemma 4 audio rms norm eps Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com macOS/iOS: macOS Apple Silicon (arm64) macOS…

34
Hugging Face Daily Papers research 1mo ago

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Abstract OmniInteract presents a streaming benchmark for real-time omnimodal large language models that evaluates online audio-visual processing with temporal grounding and interactive response requirements. AI-generated summary We introduce OmniInteract, a streaming benchmark…

25
Hugging Face Daily Papers research 1mo ago

Native Audio-Visual Alignment for Generation

Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and…

38
arXiv — NLP / Computation & Language research 1mo ago

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

arXiv:2605.27741v1 Announce Type: new Abstract: Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural…

38
arXiv — NLP / Computation & Language research 1mo ago

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

arXiv:2605.27984v1 Announce Type: new Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment…

10
TechCrunch — AI news-outlet 1mo ago

ElevenLabs’s new music generation model can switch genres mid-track

ElevenLabs' new model will let users regenerate a section of a song without affecting rest of the track

29
r/LocalLLaMA community 1mo ago

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

Howdy everyone! Quick disclosure: I work on this - it's a project my studio created called the Null Epoch. I wasn't really happy with testing my agents with the usual static benchmarks and I wanted to learn more about how models and agents handle long-horizon planning, resource…

25
r/MachineLearning community 1mo ago

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice,…

31
Hugging Face Daily Papers research 1mo ago

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Abstract Gemini Embedding 2 is a multimodal embedding model that generates unified representations for video, audio, image, and text data, achieving superior performance across diverse retrieval tasks and demonstrating strong zero-shot capabilities across specialized domains.…

18
arXiv — NLP / Computation & Language research 1mo ago

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

arXiv:2605.26978v1 Announce Type: new Abstract: Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target…

13
arXiv — NLP / Computation & Language research 1mo ago

Learning When to Think While Listening in Large Audio-Language Models

arXiv:2605.27190v1 Announce Type: new Abstract: Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until…

12
Hugging Face Daily Papers research 1mo ago

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing…

32
Hugging Face official-blog 1mo ago

Reachy Mini goes fully local

Back to Articles Reachy Mini goes fully local Published May 27, 2026 Update on GitHub Upvote 8 Amir Mahla A-Mahla Andres Marafioti andito After building your Reachy Mini, you'll install the conversation app and start talking to it. Until now, you had to send your audio to a…

20
r/LocalLLaMA community 1mo ago

Llamacpp server : How do the -np and -c flags interact?

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact. The context for each parallel client appears to be equally distributed…

10
arXiv — NLP / Computation & Language research 1mo ago

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

arXiv:2605.23954v1 Announce Type: new Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement,…

36
arXiv — NLP / Computation & Language research 1mo ago

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv:2605.23975v1 Announce Type: new Abstract: Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission,…

14
arXiv — NLP / Computation & Language research 1mo ago

Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

arXiv:2605.25179v1 Announce Type: new Abstract: Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token…

12
Hugging Face Daily Papers research 1mo ago

StepAudio 2.5 Technical Report

Abstract StepAudio 2.5 is a unified audio-language model that matches specialized systems in ASR, TTS, and real-time spoken interaction by using task-tailored reinforcement learning from human feedback to optimize shared representations across different operational modes.…

12
r/LocalLLaMA community 1mo ago

Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s)

Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit? Is there anyway to speed this up using the current hardware?   submitted by   /u/yehiaserag [link]   [comments]

35
r/LocalLLaMA community 1mo ago

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model running with LMStudio in Windows(!). My…

13
Hacker News — AI on Front Page community 1mo ago

Show HN: Audiomass – a free, open-source multitrack audio editor for the web

Article URL: https://audiomass.co/?multitrack=1 Comments URL: https://news.ycombinator.com/item?id=48258015 Points: 338 # Comments: 68

29
Hacker News — AI on Front Page community 1mo ago

BambuStudio has been violating PrusaSlicer AGPL license since their fork

Article URL: https://xcancel.com/josefprusa/status/2054602354851254330 Comments URL: https://news.ycombinator.com/item?id=48245862 Points: 249 # Comments: 96

5
r/LocalLLaMA community 1mo ago

meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Face

🚀 Model Introduction We are excited to announce the release of LongCat-Video-Avatar 1.5, an upgraded open-source framework that prioritizes extreme empirical optimization and production-readiness for audio-driven human video generation. Built upon the LongCat-Video foundation…

21
Ars Technica — AI news-outlet 1mo ago

US scrambles to stop Internet users re-creating dead pilots’ voices

Workaround flouts law that bans NTSB disclosures of cockpit audio recordings.

13
Hugging Face Daily Papers research 1mo ago

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Abstract Audio diffusion models are adapted for interactive music generation through efficient block-wise processing and novel training paradigms that enable real-time performance on consumer hardware. AI-generated summary Interactive streaming music generation promises the use…

11
Hugging Face Daily Papers research 1mo ago

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Abstract LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning…

18
r/MachineLearning community 1mo ago

Live Human Detector on Outbound Phone Calls [R]

Goal To save humans wasting time sitting in Call Centre queues waiting to be answered To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call has transitioned out of the queue and to a live person. Requirements The tool must…

20
arXiv — NLP / Computation & Language research 1mo ago

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

arXiv:2605.22012v1 Announce Type: new Abstract: Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is…

7
llama.cpp releases dev-tools 1mo ago

b9279

vulkan: fuse snake activation (mul, sin, sqr, mul, add) ( #22855 ) vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition emitted by audio…

23

How to use audio and vision modalities in llama.cpp?

Best way to index full Italian Wikipedia for 100% offline RAG in LM Studio?

google/gemma-4-12B · Hugging Face

MERIT: Learning Disentangled Music Representations for Audio Similarity

Grok Imagine Video 1.5 on AI Gateway

Benchmarks of 20 small LLMs on a 6GB RTX 4050

NVIDIA releases Cosmos 3 Omnimodal world modelson HF

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

not much happened today

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Llama Studio v0.2.0

<Think> toggle button for llama.cp web chat for QWEN3.6

Open source : Turning vocal imitations into sound effects. (New UX for sound generation)

this new Moss tts 1.5 is damn good with voice cloning

I compared all specs of the major GPUs/machines that are being used here, because bandwidth is not everything. Some of ya'll need a reality check.

Unsloth Studio updated to support training with MLX on macs

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

b9393

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Native Audio-Visual Alignment for Generation

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

ElevenLabs&#8217;s new music generation model can switch genres mid-track

I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Learning When to Think While Listening in Large Audio-Language Models

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Reachy Mini goes fully local

Llamacpp server : How do the -np and -c flags interact?

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

StepAudio 2.5 Technical Report

Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s)

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

Show HN: Audiomass – a free, open-source multitrack audio editor for the web

BambuStudio has been violating PrusaSlicer AGPL license since their fork

meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Face

US scrambles to stop Internet users re-creating dead pilots’ voices

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Live Human Detector on Outbound Phone Calls [R]

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

b9279

ElevenLabs’s new music generation model can switch genres mid-track