News / #inference Tag Inference 356 articles archived under #inference · RSS Sign in to follow r/LocalLLaMA community 16d ago vLLM has a new streaming parser for Qwen3+ available in nightly The new parser reportedly fixes the issues many were seeing with Qwen3.6-27b stopping mid turn, as well as failing streaming tool calls due to chunk boundaries. The mid turn stopping is especially annoying when trying to use the model for agentic workflows. I've not seen it… 22 NVIDIA Developer Blog official-blog 16d ago Boosting MoE Training Throughput with Advanced Fusion Kernels Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable... 36 Hacker News — AI on Front Page community 16d ago Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding? Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s) Comments URL: https://news.ycombinator.com/item?id=48542100 Points: 510 # Comments: 255 23 r/LocalLLaMA community 16d ago I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or… 34 r/LocalLLaMA community 17d ago This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b "Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV, needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache)." On the same hardware, generation speeds doubled and VRAM usage dropped significantly… 22 arXiv — Machine Learning research 17d ago A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework arXiv:2606.13880v1 Announce Type: new Abstract: Accurate estimation of long-term care transition probabilities is central to disability insurance pricing, reserving, and solvency assessment. Classical actuarial multi-state models commonly rely on Markov, semi-Markov, or… 32 arXiv — Machine Learning research 17d ago When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing arXiv:2606.14668v1 Announce Type: new Abstract: Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a… 36 arXiv — Machine Learning research 17d ago LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models arXiv:2606.13709v1 Announce Type: cross Abstract: We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention… 23 r/LocalLLaMA community 17d ago Voice-to-voice chatbot update I've been working on this after hours for a few months continuously improving it. Now at a point where the chatbot is close to real-time (thanks to SSE streaming) and also interruptible while preserving context of what was last said. 100% local and powered by Qwen3.5-397B… 33 r/LocalLLaMA community 17d ago Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s? Has anyone tested Qwen3.6-27B on NVIDIA DGX Spark / GB10 or similar systems at 256K context? I know it's a dense model, but I'm curious how it performs with MTP enabled. Looking for real numbers with: Q6/Q8 quant Q8 KV cache MTP/speculative decoding 256K context Mainly… 31 r/LocalLLaMA community 17d ago Xiaomi is now serving MiMo V2.5 at 1000-3000tps using DFlash & Persistent kernel. DFLash model is out, open-source release promised coming soon https://mimo.xiaomi.com/blog/mimo-tilert-1000tps   submitted by   /u/Dany0 [link]   [comments] 20 r/LocalLLaMA community 18d ago Yay got Gemma 12B QAT working on old 1080ti (maybe with speculative decoding?) Pretty happy with 50 tok/sec on this 9 year old GPU. Suggestions to improve anything (speed or quality) very welcome! I'm not 100% sure how to tell if the speculative decoding "model-draft" is helping or not. But hey, it is fast and seems coherent, I'm happy bash… 24 r/LocalLLaMA community 18d ago RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8   submitted by   /u/SirReal14 [link]   [comments] 22 r/LocalLLaMA community 18d ago GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test? Quick initial impressions: - at 70 tok/s slower than GLM 5.1 - seems to spend more time reasoning - better results with my Pac-Man test The one-shot result is almost functional; apart from the ghosts getting stuck immediately after leaving the ghosts house, I did not notice any… 14 Hacker News — AI on Front Page community 19d ago RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8 Article URL: https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/ Comments URL: https://news.ycombinator.com/item?id=48515454 Points: 228 # Comments: 76 5 r/LocalLLaMA community 19d ago 4× RTX PRO 6000 Blackwell on Water, and the One Card That Wouldn't Behave Converting four RTX PRO 6000 Blackwell cards to waterblocks, finding a VRM choke loose on the workbench, and getting back to 41k tok/s.   submitted by   /u/thekalki [link]   [comments] 24 The Information — AI news-outlet 19d ago Inside Tech’s Feverish Demand for Retatrutide, a Supposed Super Peptide For more than a decade, Dr. Molly Maloof has had a front-row seat to Silicon Valley’s ever-evolving health obsessions as a physician and founder of M3 Healthspan, a San Francisco–based concierge medical practice serving the tech elite. Lately, those conversations have… 25 Hugging Face Daily Papers research 20d ago From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion Abstract A multimodal image fusion approach uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal image fusion… 33 arXiv — NLP / Computation & Language research 20d ago MiniPIC: Flexible Position-Independent Caching in <100LOC arXiv:2606.13126v1 Announce Type: cross Abstract: Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV… 12 arXiv — NLP / Computation & Language research 20d ago Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty arXiv:2606.13452v1 Announce Type: cross Abstract: Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper… 18 Hugging Face Daily Papers research 20d ago ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use… 10 Hugging Face Daily Papers research 21d ago Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency Abstract PACI enables efficient asynchronous pipeline training by controlling forward/backward weight inconsistency through local gradient accumulation, achieving higher throughput and faster training time-to-accuracy without sacrificing stability or memory usage. Generated by… 9 arXiv — Machine Learning research 21d ago Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data arXiv:2606.11272v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial… 10 arXiv — Machine Learning research 21d ago LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach arXiv:2606.11463v1 Announce Type: new Abstract: Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a… 30 arXiv — Machine Learning research 21d ago Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification arXiv:2606.11650v1 Announce Type: new Abstract: Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary… 4 arXiv — NLP / Computation & Language research 21d ago External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs arXiv:2606.11806v1 Announce Type: new Abstract: Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost… 30 Hugging Face Daily Papers research 21d ago Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling Abstract Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.… 28 r/LocalLLaMA community 21d ago FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the… 25 NVIDIA Developer Blog official-blog 21d ago Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This... 6 Hugging Face Daily Papers research 21d ago FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion Abstract FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints. Generated by… 36 r/LocalLLaMA community 21d ago Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss? Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100 , and I'm currently getting around 55 tokens/sec . I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output… 31 arXiv — Machine Learning research 22d ago SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs arXiv:2606.09868v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However,… 28 arXiv — Machine Learning research 22d ago QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning arXiv:2606.09869v1 Announce Type: new Abstract: Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining… 22 arXiv — NLP / Computation & Language research 22d ago Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization arXiv:2606.09927v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large… 6 arXiv — Machine Learning research 22d ago Privacy-Preserving Credit Risk Prediction with Alternative Data arXiv:2606.10333v1 Announce Type: new Abstract: Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data,… 14 arXiv — NLP / Computation & Language research 22d ago ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval arXiv:2606.10842v1 Announce Type: new Abstract: We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder… 31 arXiv — NLP / Computation & Language research 22d ago Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the… 18 NVIDIA Developer Blog official-blog 22d ago Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster... 6 Hugging Face Daily Papers research 22d ago AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents Abstract AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training… 32 r/MachineLearning community 22d ago Are privacy-preserving techniques actually being used in production ML systems? [D] I've been reading more about privacy-preserving ML approaches such as differential privacy, federated learning, and on-device inference. The research literature is fairly active, but I'm curious about real-world adoption. For those working in industry: Are these techniques being… 16 arXiv — Machine Learning research 23d ago Enabling KV Caching of Shared Prefix for Diffusion Language Models arXiv:2606.07571v1 Announce Type: new Abstract: Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means… 11 arXiv — Machine Learning research 23d ago Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching arXiv:2606.07684v1 Announce Type: new Abstract: Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token… 16 arXiv — Machine Learning research 23d ago Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer… 5 Hugging Face Daily Papers research 23d ago CoVEBench: Can Video Editing Models Handle Complex Instructions? Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by… 19 NVIDIA Developer Blog official-blog 23d ago Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step... 34 r/MachineLearning community 23d ago Université Paris Saclay or TU Delft for Applied Mathematics Masters [R] I've been admitted into both UPS and TUD for Applied Mathematics, and I wanted to hear some advice on which one would be better. For context, I'd like to work in some form of AI research, most likely within industry. At the moment, I'm most interested in privacy preserving… 8 Hacker News — AI on Front Page community 23d ago MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second Article URL: https://mimo.xiaomi.com/blog/mimo-tilert-1000tps Comments URL: https://news.ycombinator.com/item?id=48446639 Points: 252 # Comments: 175 30 Hugging Face Daily Papers research 24d ago Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them Abstract PhaseLock is a training-free framework that improves physical consistency in image-to-video diffusion models by preserving motion priors from early-step inference throughout the denoising process. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Image-to-Video diffusion… 17 arXiv — Machine Learning research 24d ago Accelerating Reproducible Research in Synthetic EHR Generation arXiv:2606.06990v1 Announce Type: new Abstract: The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by… 13 arXiv — Machine Learning research 24d ago Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging arXiv:2606.07196v1 Announce Type: new Abstract: Classical sparse Type-II Bayesian methods for M/EEG brain imaging support joint estimation of source and noise hyperparameters, but rely on fixed iterative update rules. Although these updates are principled and interpretable,… 28 Page 3 of 8 · 356 articles ← Newer Older →