Tag

Inference

356 articles archived under #inference · RSS

arXiv — Machine Learning research 24d ago

Closed-Form Spectral Regularization for Multi-Task Model Merging

arXiv:2606.07289v1 Announce Type: new Abstract: Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models.…

38
arXiv — Machine Learning research 24d ago

Breaking the Ice: Analyzing Cold Start Latency in vLLM

arXiv:2606.07362v1 Announce Type: new Abstract: As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular,…

11
arXiv — Machine Learning research 24d ago

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

arXiv:2606.07404v1 Announce Type: new Abstract: This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense…

28
arXiv — NLP / Computation & Language research 24d ago

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

arXiv:2606.07240v1 Announce Type: new Abstract: Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026…

21
arXiv — NLP / Computation & Language research 24d ago

MMAE: A Massive Multitask Audio Editing Benchmark

arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,…

8
arXiv — NLP / Computation & Language research 24d ago

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

arXiv:2606.07356v1 Announce Type: cross Abstract: Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free…

25
Hugging Face Daily Papers research 24d ago

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

Abstract SubtleMemory benchmark evaluates AI agents' ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships. Generated by…

33
r/LocalLLaMA community 25d ago

dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model

Im into both HPC and 3D reconstruction, so I built this as a side project. dvlt.cu is a single 5MB binary: - No python, torch, TF, ONNX, llama.cpp, vLLM, or huggingface runtime - Nearly no dependencies: only cuBLASLt (shipped with libcuda ) + cuTLASS ( header only lib ) - mmap'd…

21
r/LocalLLaMA community 25d ago

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the…

17
r/LocalLLaMA community 25d ago

It felt good to return my Asus Spark

It's an incredible little package but too expensive of a price to pay for the performance and I simply didn't want to be part of the great "Superchip lie" - it could be super, but its super ruined by its limited memory bandwidth even though it *could* be 2x throughput - it…

31
r/LocalLLaMA community 25d ago

Serving TTS/cloning models on llama.cpp?

Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a separate container or conda for each model I…

17
r/LocalLLaMA community 26d ago

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Up to 5.8x throughput speedup on Qwen3 Paper : https://arxiv.org/abs/2605.29707 Code : https://github.com/jianuo-huang/Domino Models : https://huggingface.co/Huang2020   submitted by   /u/pmttyji [link]   [comments]

6
Hugging Face Daily Papers research 26d ago

LLM Anonymization Against Agentic Re-Identification

Abstract AURA is an LLM-powered anonymization framework that balances privacy protection against agentic web-search re-identification while preserving contextual utility through adaptive privacy scopes and mask-reconstruct methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

25
r/LocalLLaMA community 26d ago

Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters

Hi everyone. Please share your working launch commands for running Qwen 3.6-27B via vLLM on dual RTX 3090s (both running in PCIe 4.0 x8). I'm interested in setups both with and without an NVLink bridge. I'm familiar with the club-3090 repo, but their ready-to-use vLLM recipes…

8
r/LocalLLaMA community 26d ago

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM…

12
llama.cpp releases dev-tools 27d ago

b9521

CUDA: enroll mul_mat_vec_q_moe into pdl ( #24087 ) Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8…

10
arXiv — Machine Learning research 27d ago

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

arXiv:2606.05435v1 Announce Type: new Abstract: Differentially private stochastic gradient descent (DP-SGD) has become the standard framework for privacy-preserving machine learning, yet its reliance on a fixed gradient clipping threshold to limit sensitivity remains a…

12
arXiv — NLP / Computation & Language research 27d ago

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

arXiv:2606.05561v1 Announce Type: new Abstract: Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve…

34
r/LocalLLaMA community 27d ago

Finally finished my LLM server: EPYC 9575F, 4× RTX 3090 (96GB VRAM), 768GB ECC RAM

Took a while, but Nalthis is finally up and assembled. Specs: Supermicro H13SSL-N AMD EPYC 9575F (64C/128T Zen 5) 768GB DDR5-5600 ECC RDIMM 4× RTX 3090 (96GB VRAM total) 1× 2TB NVMe OS 2× 3.94TB NVMe data 2050W ATX 3.1 PSU Corsair 9000D Planned use: vLLM - high throughput small…

11
r/LocalLLaMA community 27d ago

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards. My setup (Node #04) Gigabyte X399 Designare EX Threadripper 1950X 128GB DDR4 4x RTX 3090 10GbE TP-Link/Aquantia NIC llama.cpp NCCL build vLLM for safetensors models I…

15
r/LocalLLaMA community 27d ago

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

The KV-cache quant race just got more interesting. Huawei just open-sourced KVarN , a KV-cache quantization method under Apache 2.0, drops into vLLM with one flag. Posting because the tradeoff it's claiming is genuinely different from what's already in the stack, and I'd like to…

20
Hugging Face Daily Papers research 28d ago

Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning

Abstract DeepMDMD combines deep learning with Koopman theory to learn latent coordinates while enforcing algebraic constraints, enabling stable forecasting and coherent structure preservation in complex dynamical systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Koopman…

32
r/LocalLLaMA community 28d ago

MTP has no impact on my Qwen3.6 MoE performance

Hello I have an rtx 5060Ti and I tried running unsloth's Qwen3.6-35B GGUF with MTP. However in both cases I have around 60 tok/s. Here are my flags: llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --alias unsloth/Qwen3.6…

35
arXiv — Machine Learning research 28d ago

Bayes-Sufficient Representations in Supervised Learning

arXiv:2606.04045v1 Announce Type: new Abstract: Representation learning is often described as preserving the information in an input that is relevant for prediction. This work asks what relevance means for a fixed supervised decision problem. A representation is defined to be…

14
arXiv — Machine Learning research 28d ago

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

arXiv:2606.04238v1 Announce Type: new Abstract: Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for…

22
arXiv — Machine Learning research 28d ago

Federated Learning for Multi-Center Sepsis Early Prediction with Privacy-Preserving

arXiv:2606.04338v1 Announce Type: new Abstract: Privacy-sensitive and distributed characteristics of multi-center medical data bring severe obstacles to centralized modeling for accurate early prediction of sepsis. Federated learning (FL) has attracted growing attention as a…

6
arXiv — Machine Learning research 28d ago

Revisiting Privacy Amplification by Subsampling in Selective Release DPSGD

arXiv:2606.04384v1 Announce Type: new Abstract: Machine learning's reliance on sensitive data necessitates privacy-preserving techniques like Differentially Private Stochastic Gradient Descent (DPSGD). However, DPSGD suffers from substantial utility degradation and slow…

28
arXiv — NLP / Computation & Language research 28d ago

SANE Schema-aware Natural-language Evaluation of Biological Data

arXiv:2606.04500v1 Announce Type: new Abstract: High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a…

23
arXiv — NLP / Computation & Language research 28d ago

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

arXiv:2606.04646v1 Announce Type: new Abstract: Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized…

6
Hugging Face Daily Papers research 28d ago

KletterMix: Climbing Toward High-Quality German Pretraining Data

Abstract A high-quality German-language corpus for language model pretraining is introduced through careful translation of an English corpus while preserving document structure and metadata, demonstrating improved downstream performance in German-language tasks. Generated by…

28
r/LocalLLaMA community 29d ago

Another shout out to llama.cpp build b9455 2x3090

https://preview.redd.it/xyvtkzwr005h1.png?width=645&format=png&auto=webp&s=aebd5b5ef79255247c9bc91fb69d8423a0c61f86 As you guys know, the next highest quant is Unsloth's /Qwen3.6-27B-UD-Q8_K_XL.gguf. With llama.cpp before, i was getting 30-50 tk/s. vllm was kicking llama's ass…

4
arXiv — Machine Learning research 29d ago

Geometry-Aware Tabular Diffusion

arXiv:2606.02607v1 Announce Type: new Abstract: Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion (GATD), which…

32
arXiv — Machine Learning research 29d ago

Fast Unlearning at Scale via Margin Self-Correction

arXiv:2606.02920v1 Announce Type: new Abstract: Language-model unlearning updates a trained model to behave as if it had not seen selected training examples, while preserving utility and avoiding costly retraining. Existing approaches typically fine-tune the pretrained model…

26
arXiv — Machine Learning research 29d ago

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

arXiv:2606.03070v1 Announce Type: new Abstract: Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected…

17
arXiv — NLP / Computation & Language research 29d ago

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

arXiv:2606.03399v1 Announce Type: new Abstract: While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy…

4
arXiv — Machine Learning research 1mo ago

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

arXiv:2606.00132v1 Announce Type: new Abstract: While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired during pretraining. Existing forgetting aware methods typically seek safer updates through…

8
arXiv — Machine Learning research 1mo ago

Multi-Objective Reference-Aligned Machine Unlearning

arXiv:2606.00399v1 Announce Type: new Abstract: Machine unlearning aims to remove the influence of specific training samples while preserving the model's utility. Existing single-objective approaches, such as gradient ascent or random relabeling, often induce catastrophic…

28
arXiv — Machine Learning research 1mo ago

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

arXiv:2606.00437v1 Announce Type: new Abstract: Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations…

19
arXiv — NLP / Computation & Language research 1mo ago

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

arXiv:2606.00356v1 Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these…

4
arXiv — NLP / Computation & Language research 1mo ago

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

arXiv:2606.00724v1 Announce Type: new Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in…

28
arXiv — NLP / Computation & Language research 1mo ago

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

arXiv:2606.01240v1 Announce Type: new Abstract: The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving…

36
Hugging Face Daily Papers research 1mo ago

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Abstract VideoMLA reduces memory usage in video diffusion models by replacing per-head keys and values with shared low-rank content and decoupled 3D-RoPE positional keys, maintaining quality while achieving significant compression and improved throughput. AI-generated summary…

19
llama.cpp releases dev-tools 1mo ago

b9460

llama: limit max outputs of llama_context ( #23861 ) llama: save more VRAM by reserving n_outputs == n_seqs when possible add n_outputs_per_seq move n_outputs_max to server-context change ubatch to batch everywhere macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…

15
r/LocalLLaMA community 1mo ago

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

The first question I have about Ling-2.6-1T is not “is the model card impressive?” It is whether the boring trade-off makes sense. It is an open-sourced Ant/InclusionAI flagship with about 1T total params / 63B activated params, up to 1M native context, and 256K currently…

21
r/LocalLLaMA community 1mo ago

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:…

24
arXiv — Machine Learning research 1mo ago

ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings

arXiv:2605.30597v1 Announce Type: new Abstract: Nonlinear dimensionality-reduction methods such as UMAP and PaCMAP adaptively normalize local distances during graph construction, erasing neighborhood scale from the data. This distorts more than relative cluster sizes: sparse…

9
arXiv — Machine Learning research 1mo ago

The Fast Mixing Mechanism for Differential Privacy

arXiv:2605.30600v1 Announce Type: new Abstract: Randomized sketching is a central tool for compressing large-scale optimization problems while preserving accuracy. In particular, sketches that are based on structured matrices, such as the Hadamard matrix, can be applied…

28
arXiv — Machine Learning research 1mo ago

Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints

arXiv:2605.30825v1 Announce Type: new Abstract: Unlearning in diffusion models aims to remove undesirable data or concepts while preserving the utility of pretrained models -- two fundamentally conflicting objectives. We propose a principled constrained optimization framework…

21
arXiv — Machine Learning research 1mo ago

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

arXiv:2605.30873v1 Announce Type: new Abstract: Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user…

35
arXiv — Machine Learning research 1mo ago

An Efficient and Scalable Graph Condensation with Structure-Preserving

arXiv:2605.31016v1 Announce Type: new Abstract: Graph condensation (GC) is pivotal for enabling Graph Neural Networks (GNNs) deployment in resource-constrained scenarios by compressing large-scale graphs into compact synthetic counterparts. Existing GC methods commonly suffer…

21

Closed-Form Spectral Regularization for Multi-Task Model Merging

Breaking the Ice: Analyzing Cold Start Latency in vLLM

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

MMAE: A Massive Multitask Audio Editing Benchmark

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

It felt good to return my Asus Spark

Serving TTS/cloning models on llama.cpp?

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

LLM Anonymization Against Agentic Re-Identification

Qwen 3.6-27B on vLLM with dual RTX 3090s: looking for launch parameters

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

b9521

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

Finally finished my LLM server: EPYC 9575F, 4× RTX 3090 (96GB VRAM), 768GB ECC RAM

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning

MTP has no impact on my Qwen3.6 MoE performance

Bayes-Sufficient Representations in Supervised Learning

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Federated Learning for Multi-Center Sepsis Early Prediction with Privacy-Preserving

Revisiting Privacy Amplification by Subsampling in Selective Release DPSGD

SANE Schema-aware Natural-language Evaluation of Biological Data

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

KletterMix: Climbing Toward High-Quality German Pretraining Data

Another shout out to llama.cpp build b9455 2x3090

Geometry-Aware Tabular Diffusion

Fast Unlearning at Scale via Margin Self-Correction

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

Multi-Objective Reference-Aligned Machine Unlearning

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

b9460

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings

The Fast Mixing Mechanism for Differential Privacy

Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

An Efficient and Scalable Graph Condensation with Structure-Preserving