Tag

Inference

356 articles archived under #inference · RSS

r/LocalLLaMA community 16d ago

vLLM has a new streaming parser for Qwen3+ available in nightly

The new parser reportedly fixes the issues many were seeing with Qwen3.6-27b stopping mid turn, as well as failing streaming tool calls due to chunk boundaries. The mid turn stopping is especially annoying when trying to use the model for agentic workflows. I've not seen it…

22
NVIDIA Developer Blog official-blog 16d ago

Boosting MoE Training Throughput with Advanced Fusion Kernels

Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable...

36
Hacker News — AI on Front Page community 16d ago

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s) Comments URL: https://news.ycombinator.com/item?id=48542100 Points: 510 # Comments: 255

23
r/LocalLLaMA community 16d ago

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or…

34
r/LocalLLaMA community 17d ago

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

"Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV, needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache)." On the same hardware, generation speeds doubled and VRAM usage dropped significantly…

22
arXiv — Machine Learning research 17d ago

A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

arXiv:2606.13880v1 Announce Type: new Abstract: Accurate estimation of long-term care transition probabilities is central to disability insurance pricing, reserving, and solvency assessment. Classical actuarial multi-state models commonly rely on Markov, semi-Markov, or…

32
arXiv — Machine Learning research 17d ago

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

arXiv:2606.14668v1 Announce Type: new Abstract: Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a…

36
arXiv — Machine Learning research 17d ago

LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

arXiv:2606.13709v1 Announce Type: cross Abstract: We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention…

23
r/LocalLLaMA community 17d ago

Voice-to-voice chatbot update

I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B…

33
r/LocalLLaMA community 17d ago

Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s?

Has anyone tested Qwen3.6-27B on NVIDIA DGX Spark / GB10 or similar systems at 256K context? I know it's a dense model, but I'm curious how it performs with MTP enabled. Looking for real numbers with: Q6/Q8 quant Q8 KV cache MTP/speculative decoding 256K context Mainly…

31
r/LocalLLaMA community 17d ago

Xiaomi is now serving MiMo V2.5 at 1000-3000tps using DFlash & Persistent kernel. DFLash model is out, open-source release promised coming soon

https://mimo.xiaomi.com/blog/mimo-tilert-1000tps   submitted by   /u/Dany0 [link]   [comments]

20
r/LocalLLaMA community 18d ago

Yay got Gemma 12B QAT working on old 1080ti (maybe with speculative decoding?)

Pretty happy with 50 tok/sec on this 9 year old GPU. Suggestions to improve anything (speed or quality) very welcome! I'm not 100% sure how to tell if the speculative decoding "model-draft" is helping or not. But hey, it is fast and seems coherent, I'm happy bash…

24
r/LocalLLaMA community 18d ago

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8

  submitted by   /u/SirReal14 [link]   [comments]

22
r/LocalLLaMA community 18d ago

GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test?

Quick initial impressions: - at 70 tok/s slower than GLM 5.1 - seems to spend more time reasoning - better results with my Pac-Man test The one-shot result is almost functional; apart from the ghosts getting stuck immediately after leaving the ghosts house, I did not notice any…

14
Hacker News — AI on Front Page community 19d ago

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Article URL: https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/ Comments URL: https://news.ycombinator.com/item?id=48515454 Points: 228 # Comments: 76

5
r/LocalLLaMA community 19d ago

4× RTX PRO 6000 Blackwell on Water, and the One Card That Wouldn't Behave

Converting four RTX PRO 6000 Blackwell cards to waterblocks, finding a VRM choke loose on the workbench, and getting back to 41k tok/s.   submitted by   /u/thekalki [link]   [comments]

24
The Information — AI news-outlet 19d ago

Inside Tech’s Feverish Demand for Retatrutide, a Supposed Super Peptide

For more than a decade, Dr. Molly Maloof has had a front-row seat to Silicon Valley’s ever-evolving health obsessions as a physician and founder of M3 Healthspan, a San Francisco–based concierge medical practice serving the tech elite. Lately, those conversations have…

25
Hugging Face Daily Papers research 20d ago

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Abstract A multimodal image fusion approach uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal image fusion…

33
arXiv — NLP / Computation & Language research 20d ago

MiniPIC: Flexible Position-Independent Caching in <100LOC

arXiv:2606.13126v1 Announce Type: cross Abstract: Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV…

12
arXiv — NLP / Computation & Language research 20d ago

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

arXiv:2606.13452v1 Announce Type: cross Abstract: Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper…

18
Hugging Face Daily Papers research 20d ago

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use…

10
Hugging Face Daily Papers research 21d ago

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

Abstract PACI enables efficient asynchronous pipeline training by controlling forward/backward weight inconsistency through local gradient accumulation, achieving higher throughput and faster training time-to-accuracy without sacrificing stability or memory usage. Generated by…

9
arXiv — Machine Learning research 21d ago

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

arXiv:2606.11272v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial…

10
arXiv — Machine Learning research 21d ago

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

arXiv:2606.11463v1 Announce Type: new Abstract: Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a…

30
arXiv — Machine Learning research 21d ago

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

arXiv:2606.11650v1 Announce Type: new Abstract: Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary…

4
arXiv — NLP / Computation & Language research 21d ago

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

arXiv:2606.11806v1 Announce Type: new Abstract: Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost…

30
Hugging Face Daily Papers research 21d ago

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Abstract Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.…

28
r/LocalLLaMA community 21d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the…

25
NVIDIA Developer Blog official-blog 21d ago

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This...

6
Hugging Face Daily Papers research 21d ago

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Abstract FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints. Generated by…

36
r/LocalLLaMA community 21d ago

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100 , and I'm currently getting around 55 tokens/sec . I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output…

31
arXiv — Machine Learning research 22d ago

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

arXiv:2606.09868v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However,…

28
arXiv — Machine Learning research 22d ago

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

arXiv:2606.09869v1 Announce Type: new Abstract: Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining…

22
arXiv — NLP / Computation & Language research 22d ago

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

arXiv:2606.09927v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large…

6
arXiv — Machine Learning research 22d ago

Privacy-Preserving Credit Risk Prediction with Alternative Data

arXiv:2606.10333v1 Announce Type: new Abstract: Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data,…

14
arXiv — NLP / Computation & Language research 22d ago

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

arXiv:2606.10842v1 Announce Type: new Abstract: We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder…

31
arXiv — NLP / Computation & Language research 22d ago

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the…

18
NVIDIA Developer Blog official-blog 22d ago

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster...

6
Hugging Face Daily Papers research 22d ago

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Abstract AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training…

32
r/MachineLearning community 22d ago

Are privacy-preserving techniques actually being used in production ML systems? [D]

I've been reading more about privacy-preserving ML approaches such as differential privacy, federated learning, and on-device inference. The research literature is fairly active, but I'm curious about real-world adoption. For those working in industry: Are these techniques being…

16
arXiv — Machine Learning research 23d ago

Enabling KV Caching of Shared Prefix for Diffusion Language Models

arXiv:2606.07571v1 Announce Type: new Abstract: Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means…

11
arXiv — Machine Learning research 23d ago

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

arXiv:2606.07684v1 Announce Type: new Abstract: Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token…

16
arXiv — Machine Learning research 23d ago

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer…

5
Hugging Face Daily Papers research 23d ago

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by…

19
NVIDIA Developer Blog official-blog 23d ago

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step...

34
r/MachineLearning community 23d ago

Université Paris Saclay or TU Delft for Applied Mathematics Masters [R]

I've been admitted into both UPS and TUD for Applied Mathematics, and I wanted to hear some advice on which one would be better. For context, I'd like to work in some form of AI research, most likely within industry. At the moment, I'm most interested in privacy preserving…

8
Hacker News — AI on Front Page community 23d ago

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

Article URL: https://mimo.xiaomi.com/blog/mimo-tilert-1000tps Comments URL: https://news.ycombinator.com/item?id=48446639 Points: 252 # Comments: 175

30
Hugging Face Daily Papers research 24d ago

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

Abstract PhaseLock is a training-free framework that improves physical consistency in image-to-video diffusion models by preserving motion priors from early-step inference throughout the denoising process. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Image-to-Video diffusion…

17
arXiv — Machine Learning research 24d ago

Accelerating Reproducible Research in Synthetic EHR Generation

arXiv:2606.06990v1 Announce Type: new Abstract: The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by…

13
arXiv — Machine Learning research 24d ago

Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging

arXiv:2606.07196v1 Announce Type: new Abstract: Classical sparse Type-II Bayesian methods for M/EEG brain imaging support joint estimation of source and noise hyperparameters, but rely on fixed iterative update rules. Although these updates are principled and interpretable,…

28

vLLM has a new streaming parser for Qwen3+ available in nightly

Boosting MoE Training Throughput with Advanced Fusion Kernels

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

Voice-to-voice chatbot update

Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s?

Xiaomi is now serving MiMo V2.5 at 1000-3000tps using DFlash & Persistent kernel. DFLash model is out, open-source release promised coming soon

Yay got Gemma 12B QAT working on old 1080ti (maybe with speculative decoding?)

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8

GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test?

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

4× RTX PRO 6000 Blackwell on Water, and the One Card That Wouldn't Behave

Inside Tech’s Feverish Demand for Retatrutide, a Supposed Super Peptide

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

MiniPIC: Flexible Position-Independent Caching in <100LOC

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

Privacy-Preserving Credit Risk Prediction with Alternative Data

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Are privacy-preserving techniques actually being used in production ML systems? [D]

Enabling KV Caching of Shared Prefix for Diffusion Language Models

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

Université Paris Saclay or TU Delft for Applied Mathematics Masters [R]

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

Accelerating Reproducible Research in Synthetic EHR Generation

Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging