Tag

Inference

356 articles archived under #inference · RSS

arXiv — Machine Learning research 2h ago

Learning Generalizable Skill Policy with Data-Efficient Unsupervised RL

arXiv:2607.00392v1 Announce Type: new Abstract: Unsupervised Reinforcement Learning (URL) aims to pre-train scalable, skill-conditioned policies without extrinsic rewards, serving as a foundation for downstream control tasks. Despite recent progress, we argue that current…

34
arXiv — Machine Learning research 2h ago

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

arXiv:2607.00760v1 Announce Type: new Abstract: Long-context LLM services now sustain prompts with hundreds of thousands to millions of tokens, making the key-value (KV) cache a first-order serving cost. Because the cache grows linearly with context length, it can exhaust GPU…

9
arXiv — Machine Learning research 2h ago

Staleness-Learning Rate Scaling Laws for Asynchronous RLHF

arXiv:2607.01083v1 Announce Type: new Abstract: High-throughput RLHF systems often decouple rollout generation from policy optimization, leading to the use of stale rollouts during learner updates. In this work, we study the effect of such staleness in asynchronous GRPO. We make…

23
arXiv — NLP / Computation & Language research 2h ago

BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal

arXiv:2607.00501v1 Announce Type: new Abstract: We present BaseRT, a native Metal inference runtime for large language models (LLMs) on Apple Silicon, and report the highest inference throughput on this hardware to date. Existing runtimes, including llama.cpp and MLX-based…

22
arXiv — NLP / Computation & Language research 2h ago

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

arXiv:2510.24636v3 Announce Type: replace Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and…

34
r/LocalLLaMA community 5h ago

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

Just wanted to share that I was looking for an optimal way to run Ornith 35B in FP8 with E4M3 and MTP with vLLM but there was no out-of-the-box model with MTP drafter support. So I grafted this new model! It's 18% faster than without MTP and the drafter acceptance rate is not…

31
r/LocalLLaMA community 13h ago

How to improve RAM offload?

I have only 12GB VRAM (RTX3060) but have enough RAM to run Qwen3.6 27B Q4 with offload. Something tells me that it won't achieve maximum performance but why DRAM speed is only around 30GB/s (HWiNFO data) during inference with dual channel 5200 RAM? TG is 3.12 tok/sec with 18K…

38
r/LocalLLaMA community 19h ago

Thinking about grabbing 4x Ascend GX10s

Some in this sub have tested GLM5.2 on 4x DGX Sparks (or Ascend GX10) with 400-500 tok/s prompt processing and ~15 tok/s output at 128k context. Not blazing fast, but usable imo, especially with quantization. My thinking: If there's an open-source fable 5 sometime in december or…

20
arXiv — Machine Learning research 1d ago

Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning

arXiv:2606.31092v1 Announce Type: new Abstract: Full fine-tuning adapts large language models to new tasks but can erode capabilities they already possess. Existing remedies protect through proxies such as parameter distances, importance penalties, output matching, or dominant…

11
arXiv — Machine Learning research 1d ago

TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling

arXiv:2606.31268v1 Announce Type: new Abstract: The growing demand for privacy-preserving data sharing has positioned synthetic data generation as a critical component of responsible AI workflows. Despite notable advances in generative modeling, existing solutions often lack…

29
arXiv — Machine Learning research 1d ago

Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR

arXiv:2606.31813v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) and its variants enable parameter-efficient fine-tuning of large language models under the supervised fine-tuning (SFT) paradigm. However, their efficacy and behavior under Reinforcement learning with…

24
arXiv — Machine Learning research 1d ago

Criticality-Constrained Iterative Pruning for Energy-Efficient Spiking Neural Networks via Combined Importance Scoring

arXiv:2606.30676v1 Announce Type: cross Abstract: Deploying spiking neural networks (SNNs) on neuromorphic hardware demands aggressive synaptic pruning while preserving temporal computation integrity. Existing strategies either neglect neuronal criticality or rely on convex…

5
arXiv — NLP / Computation & Language research 1d ago

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

arXiv:2606.31128v1 Announce Type: cross Abstract: Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion…

30
Hugging Face Daily Papers research 1d ago

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Abstract InnerZoom addresses GUI grounding challenges by preserving target-region awareness across decoder layers through a single-forward pass that bridges cross-layer evidence, achieving state-of-the-art performance with reduced computational cost. Generated by…

16
r/LocalLLaMA community 1d ago

Devs - you have 64gb of VRAM - which model do you use for coding?

I've currently settled on an unsloth version of Qwen 3.5 122b-a10b model (UD-IQ4_NL). With 100k bf16 context window, I only had to load a few layers into CPU/RAM, it runs around 30 tok/sec which is fine for me. I've tested many models, hours of testing but I am currently deeply…

32
r/LocalLLaMA community 1d ago

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between…

4
arXiv — Machine Learning research 2d ago

Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter

arXiv:2606.28441v1 Announce Type: new Abstract: Online latent state estimation constitutes a fundamental challenge within the artificial intelligence field, serving as a foundational tool for diverse applications, including sequential decision making, anomaly and change-point…

21
arXiv — Machine Learning research 2d ago

DiLaServe: High SLO Attainment Serving for Diffusion Language Models

arXiv:2606.29094v1 Announce Type: new Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to conventional autoregressive language models. By generating multiple tokens in parallel during each denoising step, they offer higher inference…

36
arXiv — Machine Learning research 2d ago

Prototype Latent World Model Replay for Class-Incremental Learning

arXiv:2606.29465v1 Announce Type: new Abstract: Class-incremental learning requires a model to learn new classes while preserving decision regions for old ones. This is difficult when raw old samples are no longer available. We propose Prototype Latent World Model Replay, a…

8
arXiv — NLP / Computation & Language research 2d ago

Structure-Preserving Document Translation via Multi-Stage LLM Pipeline: A Case Study in Marathi

arXiv:2606.28796v1 Announce Type: new Abstract: Government documents in India are predominantly issued in regional languages such as Marathi, creating substantial accessibility barriers for non-native readers, interstate administrative bodies, and policy analysts. Although…

30
arXiv — NLP / Computation & Language research 2d ago

A Comparative Study on Affective Cues in Text Embeddings Across Psychological Emotion Theories

arXiv:2606.29068v1 Announce Type: new Abstract: Text encoders are known for their utility in natural language processing, as they are able to efficiently compress inputs into dense vectors while preserving semantics. These models have been applied to affective computing, in…

19
r/LocalLLaMA community 2d ago

What's the full local AI "doomsday prepper" kit for cold storage? 16-bit safetensors of LLMs (obv), copies/source codes of Llama.cpp, ComfyUI, vLLM, Kobold, LMStudio, etc, macOS, Linux OSes, Windows 10&11, etc, Rufus (including older ones), various VMs, P-E-W's Heretic/Grimoire,…

For those who want to be as paranoid and maximally doomsday prepped as possible, I am curious what the most thorough "doomsday kit" is of things to store offline copies of "just in case", to still be able to use local AI if things go truly crazy to a super extreme level. So far…

23
r/MachineLearning community 2d ago

Cerebras OpenAI deal capacity has effectively killed the waitlist for everyone else [D]

I’m pretty annoyed. We’re a small AI startup building a real-time coding agent. Our p95 latency requirements are tight (and self imposed, but thats the product). We need sustained high-throughput inference with ~1-2k tokens/second. Been on the Cerebras waitlist for months trying…

25
arXiv — Machine Learning research 3d ago

NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

arXiv:2606.27771v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of…

8
arXiv — Machine Learning research 3d ago

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets…

21
arXiv — Machine Learning research 3d ago

Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models

arXiv:2606.27593v1 Announce Type: cross Abstract: We introduce a categorical framework called ODYSSEY for constructing verifiable, local truth-preserving foundation models as compositions of foundries: building-block architectural components that specify a cover of local…

27
arXiv — NLP / Computation & Language research 3d ago

Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving

arXiv:2606.27457v1 Announce Type: cross Abstract: Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones.…

20
arXiv — NLP / Computation & Language research 3d ago

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

arXiv:2605.06675v2 Announce Type: replace-cross Abstract: Large language models cache all previously computed key-value (KV) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving. Quantizing the KV cache…

5
r/LocalLLaMA community 3d ago

High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

I got GLM-5.2 NVFP4 running on four DGX Sparks at 128K context. This is still a niche/hacky setup, but it is now a real serving point rather than just a proof of life. Objective : A high quality 4-bit quant running on 4x spark. Model: https://huggingface.co/Mapika/GLM-5.2-NVFP4…

9
r/LocalLLaMA community 3d ago

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Follow-up to my previous Ornith-1.0-35B Q3_K_M post. I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp: 1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s). Next-token distribution is byte-identical to…

11
Hacker News — AI on Front Page community 4d ago

AMD Strix Halo RDMA Cluster Setup Guide

Article URL: https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md Comments URL: https://news.ycombinator.com/item?id=48703258 Points: 207 # Comments: 61

22
arXiv — Machine Learning research 6d ago

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

arXiv:2606.26666v1 Announce Type: new Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels…

20
arXiv — Machine Learning research 6d ago

Quantization in Federated Learning: Methods, Challenges and Future Directions

arXiv:2606.26822v1 Announce Type: new Abstract: Federated Learning (FL) has become a foundational paradigm for privacy-preserving distributed intelligence, yet its scalability remains fundamentally constrained by communication bottlenecks, device heterogeneity, and the…

20
arXiv — NLP / Computation & Language research 6d ago

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

arXiv:2606.26452v1 Announce Type: new Abstract: To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but…

31
arXiv — NLP / Computation & Language research 6d ago

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

arXiv:2606.27347v1 Announce Type: new Abstract: Whether political elites organise into rent-seeking coalitions that capture public resources or civic networks that sustain governance is a central question in comparative politics. Yet observing these complex, informal, and…

11
Hugging Face official-blog 6d ago

Run a vLLM Server on HF Jobs in One Command

Back to Articles a]:hidden"> Run a vLLM Server on HF Jobs in One Command Published June 26, 2026 Update on GitHub Upvote - Quentin Gallouédec qgallouedec You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers…

18
r/LocalLLaMA community 6d ago

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Everything runs locally in your browser using custom WebGPU kernels written by Fable 5 (before it was shut down) and Opus 4.8. The video was recorded on my M4 Max. Model: LiquidAI/LFM2.5-230M ( GGUF ) Demo: https://huggingface.co/spaces/webml-community/lfm2-webgpu-kernels  …

37
NVIDIA Developer Blog official-blog 6d ago

Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support

Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the...

38
Vercel — AI dev-tools 6d ago

AI SDK 7 is now available

AI SDK 7 is a major release for building production agents in TypeScript. The SDK has grown from model calls and chat primitives into a broader agent platform for developing, running, integrating, and observing agents across text, audio, realtime, image, and video. Every major…

8
Smol AI News news-outlet 7d ago

not much happened today

**Z.ai's GLM-5.2** leads in coding and agent benchmarks with top scores like **1595** on Code Arena: Frontend and **34.29%** reasoning accuracy with zero failures. Databricks improved GLM-5.2 speed to **392 tok/s** using hardware and optimizations. **Ornith-1.0**, a new…

13
arXiv — Machine Learning research 7d ago

TL++: Accuracy and Privacy Preserving Traversal Learning for Distributed Intelligent Systems

arXiv:2606.25627v1 Announce Type: new Abstract: Distributed intelligent systems increasingly need to train across data silos without centralizing raw data. Federated learning keeps data local but can suffer under heterogeneous partitions and requires repeated full-model…

22
arXiv — NLP / Computation & Language research 7d ago

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

arXiv:2606.24957v1 Announce Type: new Abstract: While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency.…

19
Hugging Face Daily Papers research 7d ago

RoPE-Aware Bit Allocation for KV-Cache Quantization

Abstract Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and packed cache serving. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing low-bit…

22
r/LocalLLaMA community 7d ago

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing

TL;DR: the recipe's image-build mods aren't actually public – I reconstructed them from the public kernels (with Claude) – and you have to build vLLM at the author's exact pinned ref or the real AWQ weights crash on load. Running now at ~9.4 tok/s on my own 4× GB10. Saw a link…

20
r/LocalLLaMA community 7d ago

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

I'm wondering if anyone else has come across this. I've tested the same model on llama.cpp and vLLM with similar settings and quantizations. The performance and concurrency in vLLM are much noticeably better, but sometimes the model feels less reliable. Some things I've noticed:…

27
r/LocalLLaMA community 7d ago

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

G'day. This is part 3 on my Local LLM adventures. I have a crazy system hacked server-to-desktop system : Component Spec GPUs 2x Hopper H100, 96 GB HBM3 each CPUs 2x Grace, 72 cores each Host memory 480 GB LPDDR5X per Grace, 960 GB total So I can run technically run GLM5.2.…

34
r/LocalLLaMA community 7d ago

Qwen3.6 27B more dumb in vLLM compared to llama.cpp

Hello, I recently bought a new RTX 5060Ti to pair with the RTX 5060Ti I already own, now I have 32GB of VRAM. Up until now for convenience I've used llama.cpp, for goodness' sake it works excellently when only 1 user is using it, but now there are 2 of us using it and llama.cpp…

34
r/LocalLLaMA community 8d ago

Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT

Full-document parsing instead of cropped-region OCR 32K output length for long OCR sequences Base and gundam image modes for different document layouts Transformers inference + SGLang serving with OpenAI-compatible streaming requests Built to push DeepSeek-OCR-style document…

22
arXiv — Machine Learning research 8d ago

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

arXiv:2606.23898v1 Announce Type: new Abstract: Distilling conditional diffusion models aims to transfer the behavior of a large teacher to a smaller student while preserving alignment across conditioning inputs. Unlike recognition tasks, knowledge distillation in conditional…

14
arXiv — Machine Learning research 8d ago

Offline Reinforcement Learning for Warehouse SLAM Throughput Control

arXiv:2606.23978v1 Announce Type: new Abstract: We present an offline reinforcement learning (RL) framework for optimizing SLAM throughput control in a warehouse fulfillment environment. SLAM (Scan/Label/Apply/Manifest) throughput directly influences system congestion and…

18

Learning Generalizable Skill Policy with Data-Efficient Unsupervised RL

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

Staleness-Learning Rate Scaling Laws for Asynchronous RLHF

BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

How to improve RAM offload?

Thinking about grabbing 4x Ascend GX10s

Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning

TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling

Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR

Criticality-Constrained Iterative Pruning for Energy-Efficient Spiking Neural Networks via Combined Importance Scoring

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Devs - you have 64gb of VRAM - which model do you use for coding?

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter

DiLaServe: High SLO Attainment Serving for Diffusion Language Models

Prototype Latent World Model Replay for Class-Incremental Learning

Structure-Preserving Document Translation via Multi-Stage LLM Pipeline: A Case Study in Marathi

A Comparative Study on Affective Cues in Text Embeddings Across Psychological Emotion Theories

What's the full local AI "doomsday prepper" kit for cold storage? 16-bit safetensors of LLMs (obv), copies/source codes of Llama.cpp, ComfyUI, vLLM, Kobold, LMStudio, etc, macOS, Linux OSes, Windows 10&11, etc, Rufus (including older ones), various VMs, P-E-W's Heretic/Grimoire,…

Cerebras OpenAI deal capacity has effectively killed the waitlist for everyone else [D]

NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models

Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

AMD Strix Halo RDMA Cluster Setup Guide

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Quantization in Federated Learning: Methods, Challenges and Future Directions

AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

Run a vLLM Server on HF Jobs in One Command

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support

AI SDK 7 is now available

not much happened today

TL++: Accuracy and Privacy Preserving Traversal Learning for Distributed Intelligent Systems

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

RoPE-Aware Bit Allocation for KV-Cache Quantization

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp

Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

Offline Reinforcement Learning for Warehouse SLAM Throughput Control