News / #inference Tag Inference 358 articles archived under #inference · RSS Sign in to follow arXiv — Machine Learning research 1mo ago MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data arXiv:2605.14364v1 Announce Type: new Abstract: Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal… 13 arXiv — NLP / Computation & Language research 1mo ago GradShield: Alignment Preserving Finetuning arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a… 23 Hugging Face Daily Papers research 1mo ago Topology-Preserving Neural Operator Learning via Hodge Decomposition Abstract Physical field equations on geometric meshes are analyzed through Hodge theory to develop a hybrid Eulerian-Lagrangian architecture that improves accuracy and efficiency by separating topological and geometric components. AI-generated summary In this paper, we study… 29 Vercel — AI dev-tools 1mo ago Sort providers by cost, latency, or throughput on AI Gateway You can now sort the providers behind a model by cost, time to first token (TTFT), or throughput (TPS) in AI Gateway . The default provider order blends provider reliability, quality of model output, cost, and speed of response. You can now use sort for explicit control over… 35 vLLM releases dev-tools 1mo ago v0.21.0 Highlights This release features 367 commits from 202 contributors (49 new)! Transformers v4 deprecated : This release formally deprecates transformers v4 support ( #40389 ). Users should migrate to transformers v5. C++20 build requirement : vLLM now requires a C++20-compatible… 23 r/LocalLLaMA community 1mo ago A First Comprehensive Study of TurboQuant: Accuracy and Performance TL;DR from the article: FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving… 27 r/LocalLLaMA community 1mo ago Is there a big gap between Q4 and Q6 on Qwen3.6? I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high. Maybe 65k or up to 100k. I’ve thrown around the idea of a second 3090. But I do already have some… 28 arXiv — Machine Learning research 1mo ago Inference-Time Machine Unlearning via Gated Activation Redirection arXiv:2605.12765v1 Announce Type: new Abstract: Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model… 10 arXiv — Machine Learning research 1mo ago Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle arXiv:2605.13021v1 Announce Type: new Abstract: Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most… 28 Hugging Face Daily Papers research 1mo ago MinT: Managed Infrastructure for Training and Serving Millions of LLMs Abstract MinT is a managed infrastructure system that enables efficient low-rank adaptation training and serving by keeping base models resident and moving lightweight adapter revisions, scaling across multiple dimensions including large model architectures, reduced storage… 28 llama.cpp releases dev-tools 1mo ago b9141 server, webui: accept continue_final_message flag for vLLM API compat ( #23012 ) server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generation_prompt false,… 11 r/LocalLLaMA community 1mo ago 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key… 19 Hugging Face Daily Papers research 1mo ago ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging Abstract ORBIT addresses catastrophic forgetting in large language model fine-tuning for generative retrieval by tracking parameter distances and employing weight averaging to maintain model performance. AI-generated summary Despite the rapid advancements in large language model… 7 r/LocalLLaMA community 1mo ago qwen3.6 just stops https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens on opencode. Running with vLLM with… 17 Hugging Face Daily Papers research 1mo ago Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation Abstract Pion is a spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers. AI-generated summary We introduce… 34 Hugging Face Daily Papers research 1mo ago FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation Abstract FaithfulFaces is a pose-faithful facial identity preservation framework that improves identity consistency in text-to-video generation through pose-shared alignment and explicit Euler angle embeddings. AI-generated summary Identity-preserving text-to-video generation… 38 arXiv — Machine Learning research 1mo ago Rotation-Preserving Supervised Fine-Tuning arXiv:2605.10973v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight… 22 arXiv — Machine Learning research 1mo ago Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies… 17 arXiv — NLP / Computation & Language research 1mo ago ReAD: Reinforcement-Guided Capability Distillation for Large Language Models arXiv:2605.11290v1 Announce Type: new Abstract: Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most… 27 arXiv — NLP / Computation & Language research 1mo ago SOMA: Efficient Multi-turn LLM Serving via Small Language Model arXiv:2605.11317v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every… 33 arXiv — NLP / Computation & Language research 1mo ago PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents arXiv:2605.12260v1 Announce Type: new Abstract: Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the… 8 arXiv — NLP / Computation & Language research 1mo ago ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging arXiv:2605.12419v1 Announce Type: new Abstract: Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates… 24 arXiv — NLP / Computation & Language research 1mo ago fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum arXiv:2605.11403v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked… 38 Hugging Face Daily Papers research 1mo ago MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents Abstract MemPrivacy enables privacy-preserving personalized memory in edge-cloud environments by using type-aware placeholders to protect sensitive data while maintaining semantic integrity for effective memory operations. AI-generated summary As LLM-powered agents are… 30 r/LocalLLaMA community 1mo ago Is using vLLM actually worth it if you aren't serving the model to other people? So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The… 4 NVIDIA Developer Blog official-blog 1mo ago How to Eliminate Pipeline Friction in AI Model Serving The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a... 17 r/LocalLLaMA community 1mo ago Needle: We Distilled Gemini Tool Calling Into a 26M Model We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted… 4 r/LocalLLaMA community 1mo ago New Qwen3.6 27b Autoround Quant (int4) Best Recipe I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm. I decided to use a similar Autoround recipe but use the "autorund-best" preset instead, it uses more iterations to… 34 r/LocalLLaMA community 1mo ago Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across each of 11 categories Models:… 17 Stratechery (Ben Thompson) community 1mo ago SpaceX and Anthropic, xAI’s Two Companies, Elon Musk and SpaceXAI’s Future The Anthropic xAI deal is shocking but not surprising: Musk should double down on serving other companies. 25 Hacker News — AI on Front Page community 1mo ago Preserving Fisher-Price Pixter Article URL: https://dmitry.gr/?r=05.Projects&proj=37.%20Pixter Comments URL: https://news.ycombinator.com/item?id=48091812 Points: 204 # Comments: 43 26 vLLM releases dev-tools 1mo ago v0.20.2 vLLM v0.20.2 Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL Bug Fixes DeepSeek V4 sparse attention : Re-enable the persistent topk path on Hopper and ensure the memset… 11 vLLM releases dev-tools 1mo ago v0.20.1 vLLM v0.20.1 This is a patch release on top of v0.20.0 primarily focused on DeepSeek V4 stabilization and performance improvements , along with several important bug fixes. DeepSeek V4 Base model support ( #41006 ). Multi-stream pre-attention GEMM ( #41061 ), configurable… 37 NVIDIA Developer Blog official-blog 2mo ago Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches... 17 MIT News — AI research 2mo ago Enabling privacy-preserving AI training on everyday devices A new method could bring more accurate and efficient AI models to high-stakes applications like health care and finance, even in under-resourced settings. 13 Smol AI News news-outlet 2mo ago not much happened today **vLLM v0.20.0** introduces significant improvements in memory and MoE serving efficiency, including **TurboQuant 2-bit KV cache** for **4× KV capacity** and a **2.1% latency improvement**. The update supports multiple hardware platforms like **DeepSeek V4 MegaMoE on… 9 vLLM releases dev-tools 2mo ago v0.20.0 vLLM v0.20.0 Highlights This release features 752 commits from 320 contributors (123 new)! DeepSeek V4 : Initial DeepSeek V4 support landed ( #40860 ), with DSML token-leakage fix in DSV4/3.2 ( #40806 ), DSA + MTP IMA fix ( #40772 ), and a silu clamp limit on the shared expert (… 33 NVIDIA Developer Blog official-blog 2mo ago Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy... 31 NVIDIA Developer Blog official-blog 3mo ago Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU... 17 NVIDIA Developer Blog official-blog 3mo ago NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak... 14 MIT News — AI research 3mo ago AI system learns to keep warehouse robot traffic running smoothly This new approach adapts to decide which robots should get the right of way at every moment, avoiding congestion and increasing throughput. 29 NVIDIA Developer Blog official-blog 3mo ago Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition... 38 NVIDIA Developer Blog official-blog 3mo ago Deploying Disaggregated LLM Inference Workloads on Kubernetes As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages... 15 Hugging Face official-blog 3mo ago Holotron-12B - High Throughput Computer Use Agent Back to Articles Holotron-12B - High Throughput Computer Use Agent Team Article Published March 17, 2026 Upvote 22 Pierre-Louis Cedoz plcedoz38 Hcompany Hamza Benchekroun hamza-hcompany Hcompany Aurélien Lac h-aurelien-lac Hcompany delfosse aureliendelfosseathai Hcompany Tony Wu… 6 Smol AI News news-outlet 3mo ago not much happened today **Moonshot's Attention Residuals** paper introduced an input-dependent attention mechanism over prior layers with a **1.25x compute advantage** and less than **2% inference latency overhead**, validated on **Kimi Linear 48B total / 3B active**. The paper sparked debate on… 26 Smol AI News news-outlet 3mo ago not much happened today **NVIDIA’s Nemotron 3 Super** is a **120B parameter / ~12B active** open model featuring a **hybrid Mamba-Transformer / SSM Latent MoE** architecture and **1M context window**, delivering up to **2.2x faster inference than GPT-OSS-120B** in FP4 with strong throughput gains. It… 10 NVIDIA Developer Blog official-blog 3mo ago Removing the Guesswork from Disaggregated Serving Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal... 37 MIT News — AI research 4mo ago New method could increase LLM training efficiency By leveraging idle computing time, researchers can double the speed of model training while preserving accuracy. 13 NVIDIA Developer Blog official-blog 4mo ago Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as... 25 NVIDIA Developer Blog official-blog 4mo ago Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges... 30 Page 7 of 8 · 358 articles ← Newer Older →