Tag

Inference

358 articles archived under #inference · RSS

arXiv — Machine Learning research 1mo ago

MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data

arXiv:2605.14364v1 Announce Type: new Abstract: Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal…

13
arXiv — NLP / Computation & Language research 1mo ago

GradShield: Alignment Preserving Finetuning

arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a…

23
Hugging Face Daily Papers research 1mo ago

Topology-Preserving Neural Operator Learning via Hodge Decomposition

Abstract Physical field equations on geometric meshes are analyzed through Hodge theory to develop a hybrid Eulerian-Lagrangian architecture that improves accuracy and efficiency by separating topological and geometric components. AI-generated summary In this paper, we study…

29
Vercel — AI dev-tools 1mo ago

Sort providers by cost, latency, or throughput on AI Gateway

You can now sort the providers behind a model by cost, time to first token (TTFT), or throughput (TPS) in AI Gateway . The default provider order blends provider reliability, quality of model output, cost, and speed of response. You can now use sort for explicit control over…

35
vLLM releases dev-tools 1mo ago

v0.21.0

Highlights This release features 367 commits from 202 contributors (49 new)! Transformers v4 deprecated : This release formally deprecates transformers v4 support ( #40389 ). Users should migrate to transformers v5. C++20 build requirement : vLLM now requires a C++20-compatible…

23
r/LocalLLaMA community 1mo ago

A First Comprehensive Study of TurboQuant: Accuracy and Performance

TL;DR from the article: FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving…

27
r/LocalLLaMA community 1mo ago

Is there a big gap between Q4 and Q6 on Qwen3.6?

I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high. Maybe 65k or up to 100k. I’ve thrown around the idea of a second 3090. But I do already have some…

28
arXiv — Machine Learning research 1mo ago

Inference-Time Machine Unlearning via Gated Activation Redirection

arXiv:2605.12765v1 Announce Type: new Abstract: Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model…

10
arXiv — Machine Learning research 1mo ago

Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

arXiv:2605.13021v1 Announce Type: new Abstract: Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most…

28
Hugging Face Daily Papers research 1mo ago

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Abstract MinT is a managed infrastructure system that enables efficient low-rank adaptation training and serving by keeping base models resident and moving lightweight adapter revisions, scaling across multiple dimensions including large model architectures, reduced storage…

28
llama.cpp releases dev-tools 1mo ago

b9141

server, webui: accept continue_final_message flag for vLLM API compat ( #23012 ) server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generation_prompt false,…

11
r/LocalLLaMA community 1mo ago

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key…

19
Hugging Face Daily Papers research 1mo ago

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Abstract ORBIT addresses catastrophic forgetting in large language model fine-tuning for generative retrieval by tracking parameter distances and employing weight averaging to maintain model performance. AI-generated summary Despite the rapid advancements in large language model…

7
r/LocalLLaMA community 1mo ago

qwen3.6 just stops

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens on opencode. Running with vLLM with…

17
Hugging Face Daily Papers research 1mo ago

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Abstract Pion is a spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers. AI-generated summary We introduce…

34
Hugging Face Daily Papers research 1mo ago

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Abstract FaithfulFaces is a pose-faithful facial identity preservation framework that improves identity consistency in text-to-video generation through pose-shared alignment and explicit Euler angle embeddings. AI-generated summary Identity-preserving text-to-video generation…

38
arXiv — Machine Learning research 1mo ago

Rotation-Preserving Supervised Fine-Tuning

arXiv:2605.10973v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight…

22
arXiv — Machine Learning research 1mo ago

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies…

17
arXiv — NLP / Computation & Language research 1mo ago

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

arXiv:2605.11290v1 Announce Type: new Abstract: Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most…

27
arXiv — NLP / Computation & Language research 1mo ago

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

arXiv:2605.11317v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every…

33
arXiv — NLP / Computation & Language research 1mo ago

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

arXiv:2605.12260v1 Announce Type: new Abstract: Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the…

8
arXiv — NLP / Computation & Language research 1mo ago

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

arXiv:2605.12419v1 Announce Type: new Abstract: Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates…

24
arXiv — NLP / Computation & Language research 1mo ago

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

arXiv:2605.11403v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked…

38
Hugging Face Daily Papers research 1mo ago

MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

Abstract MemPrivacy enables privacy-preserving personalized memory in edge-cloud environments by using type-aware placeholders to protect sensitive data while maintaining semantic integrity for effective memory operations. AI-generated summary As LLM-powered agents are…

30
r/LocalLLaMA community 1mo ago

Is using vLLM actually worth it if you aren't serving the model to other people?

So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The…

4
NVIDIA Developer Blog official-blog 1mo ago

How to Eliminate Pipeline Friction in AI Model Serving

The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...

17
r/LocalLLaMA community 1mo ago

Needle: We Distilled Gemini Tool Calling Into a 26M Model

We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted…

4
r/LocalLLaMA community 1mo ago

New Qwen3.6 27b Autoround Quant (int4) Best Recipe

I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm. I decided to use a similar Autoround recipe but use the "autorund-best" preset instead, it uses more iterations to…

34
r/LocalLLaMA community 1mo ago

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across each of 11 categories Models:…

17
Stratechery (Ben Thompson) community 1mo ago

SpaceX and Anthropic, xAI’s Two Companies, Elon Musk and SpaceXAI’s Future

The Anthropic xAI deal is shocking but not surprising: Musk should double down on serving other companies.

25
Hacker News — AI on Front Page community 1mo ago

Preserving Fisher-Price Pixter

Article URL: https://dmitry.gr/?r=05.Projects&proj=37.%20Pixter Comments URL: https://news.ycombinator.com/item?id=48091812 Points: 204 # Comments: 43

26
vLLM releases dev-tools 1mo ago

v0.20.2

vLLM v0.20.2 Highlights This release features 6 commits from 6 contributors (0 new)! This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL Bug Fixes DeepSeek V4 sparse attention : Re-enable the persistent topk path on Hopper and ensure the memset…

11
vLLM releases dev-tools 1mo ago

v0.20.1

vLLM v0.20.1 This is a patch release on top of v0.20.0 primarily focused on DeepSeek V4 stabilization and performance improvements , along with several important bug fixes. DeepSeek V4 Base model support ( #41006 ). Multi-stream pre-attention GEMM ( #41061 ), configurable…

37
NVIDIA Developer Blog official-blog 2mo ago

Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime

Neural network techniques are increasingly used in computer graphics to boost image quality, improve performance, and streamline content creation. Approaches...

17
MIT News — AI research 2mo ago

Enabling privacy-preserving AI training on everyday devices

A new method could bring more accurate and efficient AI models to high-stakes applications like health care and finance, even in under-resourced settings.

13
Smol AI News news-outlet 2mo ago

not much happened today

**vLLM v0.20.0** introduces significant improvements in memory and MoE serving efficiency, including **TurboQuant 2-bit KV cache** for **4× KV capacity** and a **2.1% latency improvement**. The update supports multiple hardware platforms like **DeepSeek V4 MegaMoE on…

9
vLLM releases dev-tools 2mo ago

v0.20.0

vLLM v0.20.0 Highlights This release features 752 commits from 320 contributors (123 new)! DeepSeek V4 : Initial DeepSeek V4 support landed ( #40860 ), with DSML token-leakage fix in DSV4/3.2 ( #40806 ), DSA + MTP IMA fix ( #40772 ), and a silu clamp limit on the shared expert (…

33
NVIDIA Developer Blog official-blog 2mo ago

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

As LLMs transition from simple text generation to complex reasoning, reinforcement learning (RL) plays a central role. Algorithms like Group Relative Policy...

31
NVIDIA Developer Blog official-blog 3mo ago

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...

17
NVIDIA Developer Blog official-blog 3mo ago

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak...

14
MIT News — AI research 3mo ago

AI system learns to keep warehouse robot traffic running smoothly

This new approach adapts to decide which robots should get the right of way at every moment, avoiding congestion and increasing throughput.

29
NVIDIA Developer Blog official-blog 3mo ago

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition...

38
NVIDIA Developer Blog official-blog 3mo ago

Deploying Disaggregated LLM Inference Workloads on Kubernetes

As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages...

15
Hugging Face official-blog 3mo ago

Holotron-12B - High Throughput Computer Use Agent

Back to Articles Holotron-12B - High Throughput Computer Use Agent Team Article Published March 17, 2026 Upvote 22 Pierre-Louis Cedoz plcedoz38 Hcompany Hamza Benchekroun hamza-hcompany Hcompany Aurélien Lac h-aurelien-lac Hcompany delfosse aureliendelfosseathai Hcompany Tony Wu…

6
Smol AI News news-outlet 3mo ago

not much happened today

**Moonshot's Attention Residuals** paper introduced an input-dependent attention mechanism over prior layers with a **1.25x compute advantage** and less than **2% inference latency overhead**, validated on **Kimi Linear 48B total / 3B active**. The paper sparked debate on…

26
Smol AI News news-outlet 3mo ago

not much happened today

**NVIDIA’s Nemotron 3 Super** is a **120B parameter / ~12B active** open model featuring a **hybrid Mamba-Transformer / SSM Latent MoE** architecture and **1M context window**, delivering up to **2.2x faster inference than GPT-OSS-120B** in FP4 with strong throughput gains. It…

10
NVIDIA Developer Blog official-blog 3mo ago

Removing the Guesswork from Disaggregated Serving

Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal...

37
MIT News — AI research 4mo ago

New method could increase LLM training efficiency

By leveraging idle computing time, researchers can double the speed of model training while preserving accuracy.

13
NVIDIA Developer Blog official-blog 4mo ago

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as...

25
NVIDIA Developer Blog official-blog 4mo ago

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges...

30

MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data

GradShield: Alignment Preserving Finetuning

Topology-Preserving Neural Operator Learning via Hodge Decomposition

Sort providers by cost, latency, or throughput on AI Gateway

v0.21.0

A First Comprehensive Study of TurboQuant: Accuracy and Performance

Is there a big gap between Q4 and Q6 on Qwen3.6?

Inference-Time Machine Unlearning via Gated Activation Redirection

Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

b9141

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

qwen3.6 just stops

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Rotation-Preserving Supervised Fine-Tuning

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

Is using vLLM actually worth it if you aren't serving the model to other people?

How to Eliminate Pipeline Friction in AI Model Serving

Needle: We Distilled Gemini Tool Calling Into a 26M Model

New Qwen3.6 27b Autoround Quant (int4) Best Recipe

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

SpaceX and Anthropic, xAI&#8217;s Two Companies, Elon Musk and SpaceXAI&#8217;s Future

Preserving Fisher-Price Pixter

v0.20.2

v0.20.1

Speed Up Unreal Engine NNE Inference with NVIDIA TensorRT for RTX Runtime

Enabling privacy-preserving AI training on everyday devices

not much happened today

v0.20.0

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design

AI system learns to keep warehouse robot traffic running smoothly

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

Deploying Disaggregated LLM Inference Workloads on Kubernetes

Holotron-12B - High Throughput Computer Use Agent

not much happened today

not much happened today

Removing the Guesswork from Disaggregated Serving

New method could increase LLM training efficiency

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

SpaceX and Anthropic, xAI’s Two Companies, Elon Musk and SpaceXAI’s Future