Tag

Inference

358 articles archived under #inference · RSS

arXiv — NLP / Computation & Language research 1mo ago

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

arXiv:2605.22035v1 Announce Type: cross Abstract: Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This…

33
Hugging Face Daily Papers research 1mo ago

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Abstract KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.…

30
Hugging Face Daily Papers research 1mo ago

WorldKV: Efficient World Memory with World Retrieval and Compression

Abstract WorldKV enables persistent world generation in video diffusion models by retrieving and compressing key-value cache chunks to maintain consistency while improving throughput. AI-generated summary Autoregressive video diffusion models have enabled real-time,…

22
Hugging Face Daily Papers research 1mo ago

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Abstract Research investigates subword tokenization's impact on LLM training efficiency and performance through controlled byte-level pretraining experiments, revealing key factors in training throughput and linguistic priors. AI-generated summary Subword tokenization is an…

23
r/LocalLLaMA community 1mo ago

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP. So, I decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for…

35
r/LocalLLaMA community 1mo ago

'Am I OpenAI compatible' - a tool and documentation for unified api signatures in open source AI.

This has turned out to be useful to many of my friends so I thought I'd share here as well. I created a tool and documentation page for most major open-souce project's adherence to 'OpenAI compatibility' after seeing inconsistencies between engines like vLLM and llama.cpp. Now…

18
r/MachineLearning community 1mo ago

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving). Current setup: Model: Gemma 4 26B (fine-tuned) Engine: vLLM Quantization: FP8 Hardware:…

27
Hugging Face Daily Papers research 1mo ago

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Abstract Mix-Quant is a phase-aware quantization framework that accelerates long-context, multi-turn LLM inference by applying high-throughput NVFP4 quantization to the prefilling phase while maintaining BF16 precision for decoding. AI-generated summary LLM agents have recently…

30
Hugging Face Daily Papers research 1mo ago

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Abstract Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection. AI-generated summary Safety post-training can…

32
arXiv — Machine Learning research 1mo ago

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

arXiv:2605.20247v1 Announce Type: new Abstract: Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing…

38
arXiv — Machine Learning research 1mo ago

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

arXiv:2605.20262v1 Announce Type: new Abstract: We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed…

7
arXiv — Machine Learning research 1mo ago

Consistently Informative Soft-Label Temperature for Knowledge Distillation

arXiv:2605.20357v1 Announce Type: new Abstract: Knowledge distillation (KD) transfers knowledge from a high-capacity teacher to a compact student by matching their predictive distributions, with temperature scaling serving as a central mechanism for smoothing teacher predictions…

28
arXiv — NLP / Computation & Language research 1mo ago

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

arXiv:2605.20915v1 Announce Type: new Abstract: Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation.…

36
arXiv — NLP / Computation & Language research 1mo ago

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

arXiv:2605.20936v1 Announce Type: cross Abstract: Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often…

27
r/LocalLLaMA community 1mo ago

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g \ -v…

13
r/LocalLLaMA community 1mo ago

Try ik_llama.cpp with MTP if you have limited VRAM. You will be pleasantly surprised!

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB (75-80 tok/s), until they actually merged the MTP PR. Then, performance tanked (65-70 tok/s) and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also supports MTP. I did not…

14
Hugging Face Daily Papers research 1mo ago

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Abstract POW3R is a policy-aware framework for reinforcement learning with rubric-based rewards that adapts criterion weights during training to improve policy optimization while preserving human-defined criteria importance. AI-generated summary Reinforcement learning with…

12
Hugging Face Daily Papers research 1mo ago

Base Models Look Human To AI Detectors

Abstract Instruction-tuned language models produce text that commercial detectors identify as non-human, prompting the development of a paraphrasing pipeline that improves human-likeness while preserving semantics across different model sizes. AI-generated summary As…

37
Simon Willison community 1mo ago

How fast is 10 tokens per second really?

How fast is 10 tokens per second really? Neat little HTML app by Mike Veerman ( source code here ) which simulates LLM token output speeds from 5/second to 800/second. Useful if you see a model advertised as "30 tokens/second" and want to get a feel for what that actually looks…

4
r/LocalLLaMA community 1mo ago

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me. TL;DR: 35B Q4_K_XL, no MTP,…

38
arXiv — Machine Learning research 1mo ago

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

arXiv:2605.18825v1 Announce Type: new Abstract: Prefix caching is a key optimization in Large Language Model (LLM) serving, reusing attention Key-Value (KV) states across requests with shared prompt prefixes to reduce expensive prefill computation. However, its benefit depends…

5
arXiv — Machine Learning research 1mo ago

Towards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables

arXiv:2605.18862v1 Announce Type: new Abstract: Cardiovascular disease remains the leading cause of death worldwide, and early detection of arrhythmias through continuous ECG monitoring on wearable devices can prevent life-threatening events. Federated Learning (FL) enables…

26
arXiv — Machine Learning research 1mo ago

Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

arXiv:2605.18899v1 Announce Type: new Abstract: Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving…

25
arXiv — Machine Learning research 1mo ago

KVBuffer: IO-aware Serving for Linear Attention

arXiv:2605.19049v1 Announce Type: new Abstract: Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by…

28
arXiv — NLP / Computation & Language research 1mo ago

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

arXiv:2605.19723v1 Announce Type: new Abstract: Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning…

19
r/LocalLLaMA community 1mo ago

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Hey r/DeepSeek , Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget machine that successfully runs…

29
arXiv — Machine Learning research 1mo ago

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

arXiv:2605.16438v1 Announce Type: new Abstract: Federated Learning (FL) trains a global model across decentralized clients while preserving data privacy, but at scale it is vulnerable to malicious updates. Byzantine-resilient aggregation methods such as MultiKrum score gradients…

23
arXiv — Machine Learning research 1mo ago

Wavelet Flow Matching for Multi-Scale Physics Emulation

arXiv:2605.16573v1 Announce Type: new Abstract: Accurate emulation of multi-scale physical systems governed by PDEs demands models that remain stable over long autoregressive rollouts while preserving fine-scale structures. Deterministic emulators produce overly-smoothed…

5
arXiv — NLP / Computation & Language research 1mo ago

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

arXiv:2605.16839v1 Announce Type: new Abstract: Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed…

31
arXiv — NLP / Computation & Language research 1mo ago

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

arXiv:2605.16882v1 Announce Type: new Abstract: Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for…

4
arXiv — NLP / Computation & Language research 1mo ago

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

arXiv:2605.17672v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing…

8
r/MachineLearning community 1mo ago

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads. The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels.…

13
r/LocalLLaMA community 1mo ago

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs. Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs: Strix Halo (Framework Desktop, ROCm 7.0.2): Q4_K_M: 11.7 → 21.2 tok/s (1.81×) Q8_0: 7.4…

31
r/LocalLLaMA community 1mo ago

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

Has anyone here tested different KV cache quantizations and compared their performance? I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 40 tok/s with a 128k total…

38
r/LocalLLaMA community 1mo ago

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp was a good start, BeeLlama worth…

17
Hugging Face Daily Papers research 1mo ago

PhysBrain 1.0 Technical Report

Abstract PhysBrain 1.0 leverages human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art performance in embodied control tasks through capability-preserving adaptation. AI-generated summary…

28
arXiv — Machine Learning research 1mo ago

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

arXiv:2605.15393v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply…

11
arXiv — NLP / Computation & Language research 1mo ago

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

arXiv:2605.15794v1 Announce Type: new Abstract: We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural…

19
Hacker News — AI on Front Page community 1mo ago

How fast is N tokens per second really?

Article URL: https://mikeveerman.github.io/tokenspeed/ Comments URL: https://news.ycombinator.com/item?id=48174920 Points: 200 # Comments: 52

21
r/LocalLLaMA community 1mo ago

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and…

4
r/LocalLLaMA community 1mo ago

MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware

As usual, disclosure first: I'm on the team that built this. Our MiroThinker-1.7-deepresearch and 1.7-mini-deepresearch API went live, mini is a deep research agent built on Qwen3 MoE (30B total, 3B active for mini). Weights on HuggingFace:…

14
r/LocalLLaMA community 1mo ago

Using Intel Arc Pro series, any thoughts ?

Simple question: Has anyone run two or more of either of these on Ubuntu ? Intel Arc Pro B70 (32 GB) Intel Arc Pro B65 (32 GB) Running llama or vLLM etc., Any thoughts   submitted by   /u/BikerBoyRoy123 [link]   [comments]

13
r/LocalLLaMA community 1mo ago

Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models). https://oval-shell-31c.notion.site/Open-dLLM-Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a I forked…

23
Hugging Face Daily Papers research 1mo ago

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Abstract Geodesic flow matching improves image generation by projecting latents onto fixed radius spheres and using spherical linear interpolation instead of linear paths, preserving semantic content through angular components. AI-generated summary Latent flow matching for image…

26
r/LocalLLaMA community 1mo ago

is there a centralized website for llm launch commands?

I keep on finding myself scrounging wikis and whatnot for everyone's serving commands, is there a site where users could contribute their commands, hardware, runtime and whatnot?   submitted by   /u/onephn [link]   [comments]

33
r/LocalLLaMA community 1mo ago

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention. 12K context, native system role, sampler defaults from the model card. Cached TTFT around 200ms, sustained 14-15 tok/s. SenseVoiceSmall for STT, Piper for TTS with…

21
r/LocalLLaMA community 1mo ago

Important (vision) Qwen3.5 template fix dropped in vllm

Sharing this because I personally had some annoying issues and I can confirm this un-fucked them. Basically once you posted an image in the conversation the model went haywire. Not too badly but annoying   submitted by   /u/Dany0 [link]   [comments]

14
r/LocalLLaMA community 1mo ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests. The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context…

28
arXiv — Machine Learning research 1mo ago

PreFT: Prefill-only finetuning for efficient inference

arXiv:2605.14217v1 Announce Type: new Abstract: Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management…

32
arXiv — Machine Learning research 1mo ago

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

arXiv:2605.14289v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to…

36

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

WorldKV: Efficient World Memory with World Retrieval and Compression

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

'Am I OpenAI compatible' - a tool and documentation for unified api signatures in open source AI.

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

Consistently Informative Soft-Label Temperature for Knowledge Distillation

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

Try ik_llama.cpp with MTP if you have limited VRAM. You will be pleasantly surprised!

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Base Models Look Human To AI Detectors

How fast is 10 tokens per second really?

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

Towards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables

Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

KVBuffer: IO-aware Serving for Linear Attention

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

Wavelet Flow Matching for Multi-Scale Physics Emulation

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

PhysBrain 1.0 Technical Report

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

How fast is N tokens per second really?

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware

Using Intel Arc Pro series, any thoughts ?

Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

is there a centralized website for llm launch commands?

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Important (vision) Qwen3.5 template fix dropped in vllm

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

PreFT: Prefill-only finetuning for efficient inference

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification