Tag

Inference

356 articles archived under #inference · RSS

r/MachineLearning community 1mo ago

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama) [P]

Hey! I'm a CS student and I got tired of not being able to compare MLX inference engines properly — every benchmark out there is either made by the engine's own developers, runs on an M3 Ultra nobody has, or just shows tok/s with zero context. So I built mlx-Chronos — a small…

11
r/LocalLLaMA community 1mo ago

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Under $1000 for 32gb vram from 2023, and ~300 watts draw... and this thing is outperforming the latest pick-your-vendor $5k mini pcs from 2026. So.. next question is can I make it squeeze 150 t/s with the same q4xl on cuda 13.3 this weekend. Anyone try it yet?   submitted by…

13
r/LocalLLaMA community 1mo ago

MINISFORUM UM790 Pro

Hi, Anyone tried this mini pc with llama.cpp or vLLM ? Thi what I have seen: "Budget and Compact Hardware MINISFORUM UM790 Pro ($351) is perhaps the most striking data point in the current local AI landscape." Is it true?   submitted by   /u/codeltd [link]  …

19
NVIDIA Developer Blog official-blog 1mo ago

DynoSim: Simulating the Pareto Frontier

Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...

22
r/LocalLLaMA community 1mo ago

Step-3.7-Flash-NVFP4 thinking for many minutes

Anyone else seeing Step-3.7-Flash-NVFP4 thinking for many minutes? I'm using it with Cline and can see it thinking for in some cases 14 minutes with vLLM reporting generation of 90 tokens/s every 10s.   submitted by   /u/NaiRogers [link]   [comments]

19
r/LocalLLaMA community 1mo ago

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Hey guys, I spent the last few weeks benchmarking Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B locally GGUF, FP8 using both vLLM and llama.cpp . MTP is the inference trick every major lab is quietly adding to their stack right now and the results genuinely…

19
r/LocalLLaMA community 1mo ago

vLLM PR adding native HIP W4A16 kernel was merged

The performance increase introduced by the PR is awesome. Makes my ROCm rig a lot more useful. Numbers from the PR: Kernel dtype max-num-seqs=8 max-num-seqs=32 Triton W4A16 bf16 82.4 tk/s - Triton W4A16 fp16 83.2 tk/s - ExLlama (no bf16) fp16 255.0 tk/s 382.5 tk/s RDNA3 W4A16…

27
r/LocalLLaMA community 1mo ago

Step 3.7 Flash Config + Early Data on 2x RTX 6000's

Setup Step 3.7 Flash on two Blackwell RTX Pro 6000's and got it running and recorded the configs and settings as well as early data and readings like tokens per second on general inference. Running extended bench tests now just wanted to get this to folks early. It's past…

21
arXiv — NLP / Computation & Language research 1mo ago

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

arXiv:2605.29000v1 Announce Type: new Abstract: Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes…

27
arXiv — NLP / Computation & Language research 1mo ago

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

arXiv:2605.29379v1 Announce Type: new Abstract: We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's…

38
arXiv — NLP / Computation & Language research 1mo ago

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

arXiv:2605.29555v1 Announce Type: new Abstract: As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a…

31
Hugging Face Daily Papers research 1mo ago

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Abstract Language models struggle with managing long-term information through contextual belief management, which involves updating, preserving, and filtering relevant information, and can be improved using reinforcement learning and representation-level steering techniques.…

14
r/LocalLLaMA community 1mo ago

Claude cli >= 2.1.154 breaks local use with vLLM by introducing "ctx", "msg" and "system" roles for API messages. This 1-line patch to vLLM fixes it.

diff --git a/vllm/entrypoints/anthropic/protocol.py b/vllm/entrypoints/anthropic/protocol.py index 3ebc17117..2d5726d73 100644 --- a/vllm/entrypoints/anthropic/protocol.py +++ b/vllm/entrypoints/anthropic/protocol.py @@ -65,7 +65,7 @@ class AnthropicContentBlock(BaseModel):…

29
r/LocalLLaMA community 1mo ago

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?

EDIT - IGNORE. I MADE A MISTAKE. The "better" model was 27b dense, not 35ba3b. Which also proves that 35b is not the best for coding related tasks. With 27b fp8 on VLLM - the prefil speed is around 1500tokens/sec and token gen is around 25tokens/sec. Ill need to run llama again…

37
r/LocalLLaMA community 1mo ago

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090? The project is starting to seem very bloated, at least readme wise. I use…

6
arXiv — Machine Learning research 1mo ago

Metric-Aware PCA as a Linear Instance of Geometric Deep Learning

arXiv:2605.27456v1 Announce Type: new Abstract: Geometric deep learning organises neural architectures around the symmetries of their data domain, with the choice of symmetry group serving as a geometric prior that determines what representations can be learned. Metric-Aware…

23
arXiv — Machine Learning research 1mo ago

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

arXiv:2605.27469v1 Announce Type: new Abstract: Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a better ability to balance ``Plasticity-Stability", deserving to be chosen? The logit shift…

35
arXiv — Machine Learning research 1mo ago

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized…

17
arXiv — Machine Learning research 1mo ago

SPAR: Support-Preserving Action Rectification

arXiv:2605.27877v1 Announce Type: new Abstract: Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions…

5
arXiv — Machine Learning research 1mo ago

RW-TTT: Batched Serving for Request-Owned Test-Time Training State

arXiv:2605.28053v1 Announce Type: new Abstract: Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static…

8
arXiv — Machine Learning research 1mo ago

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

arXiv:2605.28302v1 Announce Type: new Abstract: Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D)…

9
arXiv — NLP / Computation & Language research 1mo ago

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

arXiv:2605.28073v1 Announce Type: new Abstract: Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands…

15
arXiv — NLP / Computation & Language research 1mo ago

Why We Need Speech to Evaluate Speech Translation

arXiv:2605.28227v1 Announce Type: new Abstract: Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and…

35
r/LocalLLaMA community 1mo ago

Vulnerability found in framework used by VLLM, many MCP servers, and other LLM tools

Worth taking a look to see if this affects any of you. Surprised nobody has posted it yet.   submitted by   /u/Hrethric [link]   [comments]

4
The Information — AI news-outlet 1mo ago

Crypto-Friendly United Texas Bank Switches Regulator to OCC

United Texas Bank, a crypto-friendly bank, said it successfully switched its regulator through a charter conversion, despite being under a consent order. The move will help it grow its business serving digital asset firms and foreign banks. The Dallas-based bank, which has about…

26
r/LocalLLaMA community 1mo ago

Is there any use case for large models with very slow token output for batch processing?

Maybe I'm influenced by the sci-fi story "The Last Question" by Issac Assimov but I've always got a tickle imagining a huge model like Kimi running on, say, disk. Even if it is 0.001 tok/sec to ask complex questions and get an answer in a week Is there any use or community…

17
Hugging Face Daily Papers research 1mo ago

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Abstract ZeroUnlearn addresses privacy concerns in large language models by reformulating machine unlearning as precise knowledge re-mapping through model editing, enabling efficient and targeted removal of sensitive information while preserving general model utility.…

38
Hugging Face Daily Papers research 1mo ago

Rethinking VLM Representation for VLA Initialization

Abstract Effective vision-language-action model initialization requires balancing pretrained vision-language model representations with embodied task-specific adaptations and robot-data pretraining while preserving core action-relevant features. AI-generated summary…

22
Hugging Face Daily Papers research 1mo ago

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Abstract Long-horizon agentic reasoning is enhanced through a state-adaptive memory framework that dynamically manages interaction histories by creating compact memory cues while preserving detailed trajectories for targeted retrieval. AI-generated summary Long-horizon agentic…

20
arXiv — Machine Learning research 1mo ago

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

arXiv:2605.26243v1 Announce Type: new Abstract: Graph neural networks (GNNs) achieve strong performance on relational data, but real-world graphs are often distributed across organizations that cannot share raw data due to privacy and policy constraints. Existing federated GNN…

30
Hugging Face Daily Papers research 1mo ago

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Abstract Parallel Box Decoding enables efficient and accurate unified visual grounding and detection by decoding geometric elements as atomic units, improving both throughput and localization quality. AI-generated summary Vision-language models (VLMs) commonly formulate visual…

8
r/LocalLLaMA community 1mo ago

Fast little local memory retriever for Hermes

As title says. Looking for suggestions of a good memory retriever (for use with hindsight/hermes) ideally that can run on a strix halo NPU. GPT OSS 20B would be good based on their outdated rankings but it’s slow on the NPU for this type of task — needs very high throughput to…

4
r/LocalLLaMA community 1mo ago

Looking for Suggestions — Single 5090 & 64gb DDR5

Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if that’s the case what would yall do to…

10
r/LocalLLaMA community 1mo ago

Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends. For example, to run pi + vllm: # model…

26
Smol AI News news-outlet 1mo ago

not much happened today

**Inference optimization** is increasingly architectural, with **EAGLE 3.1** improving speculative decoding and long-context handling, collaborating with **vLLM** and **TorchSpec**. **Perplexity** open-sourced a rebuilt **Unigram tokenizer** cutting CPU use by **5–6×** and…

15
arXiv — Machine Learning research 1mo ago

PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

arXiv:2605.24249v1 Announce Type: new Abstract: The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information. Federated Learning (FL) offers a distributed alternative,…

19
arXiv — Machine Learning research 1mo ago

Hardware-Aware Federated Learning for Speech Emotion Recognition

arXiv:2605.24712v1 Announce Type: new Abstract: Federated learning (FL) enables privacy-preserving collaborative training across distributed edge devices, but real deployments involve heterogeneous clients with different processing power, memory capacity, and communication…

16
arXiv — NLP / Computation & Language research 1mo ago

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

arXiv:2605.24885v1 Announce Type: new Abstract: Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While…

30
r/LocalLLaMA community 1mo ago

Qwen 3.6 benchmarks on 2x RTX PRO 6000

Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend. This was for a personal project. Qwen 3.6 27B BF16 (Original without any quantization) ------ MTP - Off | 64 concurrency | 1600 tps…

8
arXiv — Machine Learning research 1mo ago

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

arXiv:2605.22898v1 Announce Type: new Abstract: Federated learning protocols face a structural trilemma: canonical server-based aggregation~\cite{mcmahan2017} creates a single point of failure and gradient inversion risk; decentralised ring-gossip…

23
arXiv — Machine Learning research 1mo ago

Building a privacy-preserving Federated Recommender system for mobile devices

arXiv:2605.22924v1 Announce Type: new Abstract: Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present…

25
arXiv — Machine Learning research 1mo ago

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

arXiv:2605.23057v1 Announce Type: new Abstract: ModeSwitch-LLM is a lightweight request-boundary controller for improving single-GPU large language model inference efficiency by routing each request to an appropriate fixed inference mode. Instead of relying on one static serving…

15
arXiv — NLP / Computation & Language research 1mo ago

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

arXiv:2605.23605v1 Announce Type: cross Abstract: Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked…

18
r/LocalLLaMA community 1mo ago

Could someone please help explain these results?

I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to 34 tok/s). Shouldn't it have slowed down from the CPU having to do so much more…

22
r/LocalLLaMA community 1mo ago

How are you all handling agents and sub agents?

Currently got it setup in Librechat to use DeepSeek v4 pro via OpenRouter to be the master planner, then have my PC running Qwen 35B @ 160ish tok/sec locally, and my mini PC running Gemma E2B locally for smaller tasks. Im wondering if there are setups out there to effectively…

10
r/LocalLLaMA community 1mo ago

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Hello everyone! I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB. Inspired by u/Due-Project-7507 's work on Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF . Using the same pure quantization method, I was able to create a Q4_K_M…

19
llama.cpp releases dev-tools 1mo ago

b9291

SYCL: improve MoE prefill throughput ( #23142 ) change k_copy_src1_to_contiguous so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends switch the O(n_as * n_routed_rows) contraption to a counting sort-based…

27
r/LocalLLaMA community 1mo ago

Cannot get NCCL test to run in docker with 2 x 6000 Pro connected x8 to AM4 CPU

nvidia-smi topo -m is showing the both GPU as PHB (i.e. via CPU) connected as expected but I cannot get NCCL all_reduce_perf to run at all, it always hangs after starting up. It seems that vllm won't work with TP=2 until I can fix this. Is there any reason why this setup would…

5
arXiv — NLP / Computation & Language research 1mo ago

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

arXiv:2605.22035v1 Announce Type: cross Abstract: Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This…

33
Hugging Face Daily Papers research 1mo ago

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Abstract KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.…

30

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama) [P]

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

MINISFORUM UM790 Pro

DynoSim: Simulating the Pareto Frontier

Step-3.7-Flash-NVFP4 thinking for many minutes

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

vLLM PR adding native HIP W4A16 kernel was merged

Step 3.7 Flash Config + Early Data on 2x RTX 6000's

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Claude cli >= 2.1.154 breaks local use with vLLM by introducing "ctx", "msg" and "system" roles for API messages. This 1-line patch to vLLM fixes it.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Metric-Aware PCA as a Linear Instance of Geometric Deep Learning

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

SPAR: Support-Preserving Action Rectification

RW-TTT: Batched Serving for Request-Owned Test-Time Training State

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

Why We Need Speech to Evaluate Speech Translation

Vulnerability found in framework used by VLLM, many MCP servers, and other LLM tools

Crypto-Friendly United Texas Bank Switches Regulator to OCC

Is there any use case for large models with very slow token output for batch processing?

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Rethinking VLM Representation for VLA Initialization

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Fast little local memory retriever for Hermes

Looking for Suggestions — Single 5090 & 64gb DDR5

Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode

not much happened today

PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

Hardware-Aware Federated Learning for Speech Emotion Recognition

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

Qwen 3.6 benchmarks on 2x RTX PRO 6000

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

Building a privacy-preserving Federated Recommender system for mobile devices

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

Could someone please help explain these results?

How are you all handling agents and sub agents?

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

b9291

Cannot get NCCL test to run in docker with 2 x 6000 Pro connected x8 to AM4 CPU

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving