Tag

Inference

356 articles archived under #inference · RSS

arXiv — Machine Learning research 8d ago

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

arXiv:2606.23993v1 Announce Type: new Abstract: High-throughput scientific facilities such as the Large Hadron Collider depend on real-time event filtering (\textit{triggering}) under tight constraints on bandwidth, latency, and storage. In practice, trigger menus are largely…

24
arXiv — Machine Learning research 8d ago

EnerInfer: Energy-Aware On-Device LLM Inference

arXiv:2606.23001v1 Announce Type: cross Abstract: On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding…

13
arXiv — NLP / Computation & Language research 8d ago

A P\={a}ninian Foundation for Indic Language Processing

arXiv:2606.24172v1 Announce Type: new Abstract: More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks…

24
arXiv — Machine Learning research 8d ago

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is…

8
arXiv — NLP / Computation & Language research 8d ago

Qwen-AgentWorld: Language World Models for General Agents

arXiv:2606.24597v1 Announce Type: new Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can…

8
arXiv — NLP / Computation & Language research 8d ago

Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity

arXiv:2606.24623v1 Announce Type: new Abstract: Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent…

30
arXiv — NLP / Computation & Language research 8d ago

ComputeFHE: A Privacy-Preserving General-Purpose Computation Library

arXiv:2606.24379v1 Announce Type: cross Abstract: Fully Homomorphic Encryption (FHE) enables computations to be performed directly on encrypted data while preserving data confidentiality. However, its practical applications remain limited by high computational costs and…

6
Vercel — AI dev-tools 8d ago

GLM 5.2 Fast via Wafer now available on AI Gateway

GLM 5.2 Fast via Wafer is now available on AI Gateway . Based on our own benchmarking across small-context, large-context, and tool-call scenarios, Wafer delivers a 2x higher throughput than other providers serving GLM-5.2 on serverless, leading on decode and end-to-end speed…

7
Hugging Face Daily Papers research 8d ago

Vera: A Layered Diffusion Model for Content-Preserving Video Editing

Abstract Vera is a layered diffusion framework that preserves video content during editing by generating edit layers and alpha mattes through a Mixture-of-Transformers architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video diffusion models have enabled remarkable…

10
r/MachineLearning community 8d ago

What's your biggest pain point when choosing between cloud GPU providers for LLM inference?[R]

Trying to understand how other people make this decision. Do you compare $/hr, $/token, throughput, reliability? Is there a tool or resource you rely on, or are you just doing the math manually? Asking because I'm an ML engineer who's been doing this in spreadsheets and…

14
r/LocalLLaMA community 9d ago

New ablation operator. (apostate)

Today I added a new operator to apostate. This new operator is a contrastive co-vector edit E = I − R Dᵀ . Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless variance along it leaves the refusal that is entangled with general…

34
r/LocalLLaMA community 10d ago

A100 slow Qwen3.6-27B-FP8

Setting up a server for someone who has an A100 80GB, even though this doesn't natively support FP8 does 43tps decode sound too low for single request? For comparison the exact same vllm config on my RTX 6000 PRO runs the same single request test at 130tps. For 8 concurrent…

11
r/LocalLLaMA community 10d ago

Qwen 27B for planning, Qwen 35B-A3B for execution?

My 32GB unified memory setup runs both, though 27B even with MTP is something like 7-10 tok/sec. Usable but not real time by any means. (~18 tok/sec with 35B-A3B) Would it be worth using 27B to plan long horizon tasks, put together the PLAN.md, and have 35B-A4B iterate over it…

14
r/LocalLLaMA community 10d ago

ROCm vs Vulkan vs vLLM on Dual R9700's

Just wanted to share these numbers I saw running Qwen3.6 35BA3 and Qwen3.6 27B and the big increase I saw going to vLLM. I was just expecting better concurrency but ended up with a lot better speeds. llama.cpp services Running ROCm and Vulkan Model Backend Gen 35B-A3B Q6_K_XL…

19
r/LocalLLaMA community 10d ago

R9700 abysmal performance, getting desparate

I've been trying to get my 2x R9700 setup to work for the past two weeks. This has been such a time sink I wish I had just gone with nvidia. At this point I'm close to selling the cards. I need vLLM. This is a dedicated setup for multi-user serving. I've tried the…

17
r/LocalLLaMA community 11d ago

I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config.

If you run open-source models and want to understand what's actually happening under the hood — I spent the last few months writing a 15-part series that covers the full stack from tokenization to production serving. Most articles are grounded in Gemma 4 12B as the running…

19
r/MachineLearning community 11d ago

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook. Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and…

13
r/LocalLLaMA community 12d ago

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost. Setup: 4x 5060 ti (16GB) with P2P If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB…

30
Hugging Face Daily Papers research 12d ago

Duration Aware Scheduling for ASR Serving Under Workload Drift

Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by…

26
Hugging Face Daily Papers research 13d ago

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Abstract A unified controllable video world model generates videos from a single image while preserving scene structure and transferring to target weather states through specialized parameterization and conditioning techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video…

22
arXiv — Machine Learning research 13d ago

A Hybrid GNN-FEM Framework for Phase-Field Fracture Simulation. Physics-Preserving Hybridization for Generalizable Surrogate Modeling

arXiv:2606.19378v1 Announce Type: new Abstract: Scientific machine learning (SciML) has emerged as a promising approach for accelerating simulations of complex physical systems, yet achieving physically consistent and generalizable predictions for nonlinear, history-dependent…

28
arXiv — Machine Learning research 13d ago

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

arXiv:2606.19679v1 Announce Type: new Abstract: Lifelong knowledge editing aims to efficiently and sequentially update language models over time, as new knowledge becomes available or when the model makes mistakes, while preserving acceptable performance on past knowledge. One…

31
arXiv — Machine Learning research 13d ago

An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling

arXiv:2606.19770v1 Announce Type: new Abstract: We propose an information-theoretic framework for graph novelty generation, which aims to generate data that are distinct from existing patterns while preserving global structural consistency. Our approach embeds data into a latent…

32
arXiv — Machine Learning research 13d ago

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

arXiv:2606.19993v1 Announce Type: new Abstract: We present Activation- and Influence-Aware Ranks (AIR), an SVD-based LLM compression framework that guides each weight matrix's low-rank approximation with a backward-signal influence metric. Starting from the activation-aware…

38
arXiv — NLP / Computation & Language research 13d ago

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

arXiv:2606.19667v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token…

15
arXiv — NLP / Computation & Language research 13d ago

Closing the Calibration Gap in Semantic Caching

arXiv:2606.19719v1 Announce Type: cross Abstract: Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether…

26
arXiv — NLP / Computation & Language research 13d ago

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

arXiv:2606.19808v1 Announce Type: cross Abstract: Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes.…

25
r/LocalLLaMA community 13d ago

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

TLDR: For the first time, I feel relief that they could shut down the cloud services and I would be ok. I got my 4th 3090 and then unsloth dropped the Q2 and Q1. I wrote nothing else here its from CC, so it might be wrong. GLM-5.2 UD-IQ2_M runs across 4×3090 + RAM expert offload…

7
r/LocalLLaMA community 13d ago

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Figured I'd post up a bit of info for anyone else who was thinking about messing with this model on a 3090/4090. Obviously I can't use the nvfp4, but I got it up and running in vLLM using diffusiongemma-26B-A4B-it-AWQ-INT4. Had to run it in a custom vLLM docker they provide for…

34
r/MachineLearning community 13d ago

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU." As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified…

29
r/LocalLLaMA community 13d ago

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

The best i can get from Qwen3.6-27B on my 32GB VRAM (2 x 5060) is ~60 tok/sec gen speed at context size 196608. (sakamakismile text nvfp4). Fp8 kv quantization. NVFP4 kv cache quantization can’t get here fast enough. Reminds me of the time there was this game i couldn’t play on…

38
arXiv — Machine Learning research 14d ago

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

arXiv:2606.18309v1 Announce Type: new Abstract: Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that…

24
arXiv — Machine Learning research 14d ago

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

arXiv:2606.18384v1 Announce Type: new Abstract: Hierarchical Federated Learning (HFL) enables scalable collaborative model training across distributed devices while preserving data privacy. However, existing HFL client selection mechanisms suffer from a fundamental strategic…

31
arXiv — Machine Learning research 14d ago

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

arXiv:2606.18431v1 Announce Type: new Abstract: LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such…

13
arXiv — Machine Learning research 14d ago

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

arXiv:2606.18518v1 Announce Type: new Abstract: The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution,…

4
arXiv — Machine Learning research 14d ago

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals,…

18
arXiv — Machine Learning research 14d ago

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

arXiv:2606.18627v1 Announce Type: new Abstract: Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task…

30
arXiv — NLP / Computation & Language research 14d ago

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

arXiv:2606.18473v1 Announce Type: new Abstract: Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often…

15
r/LocalLLaMA community 14d ago

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Before Fable 5 was shutdown, it helped us optimize our Gemma 4 WebGPU kernels, reaching around 255 tokens per second on my M4 Max. Today, we're releasing the demo and kernels for you to try out yourself. Hope you find it interesting! Links: - Demo (+ kernels):…

9
arXiv — Machine Learning research 15d ago

Performance-Driven Environment Abstraction with Multi-Timescale Learning

arXiv:2606.17377v1 Announce Type: new Abstract: We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We…

8
Hugging Face Daily Papers research 16d ago

Memento: Reconstruct to Remember for Consistent Long Video Generation

Abstract Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-form video generation requires…

17
Hugging Face Daily Papers research 16d ago

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Abstract Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management. Generated…

14
arXiv — Machine Learning research 16d ago

Repeated Bilateral Trade: The Quest for Fairness

arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the…

34
arXiv — Machine Learning research 16d ago

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

arXiv:2606.15730v1 Announce Type: new Abstract: Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common…

35
arXiv — NLP / Computation & Language research 16d ago

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

arXiv:2606.15266v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains…

16
arXiv — NLP / Computation & Language research 16d ago

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

arXiv:2606.15333v1 Announce Type: new Abstract: LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as…

5
arXiv — NLP / Computation & Language research 16d ago

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

arXiv:2606.15335v1 Announce Type: new Abstract: When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and…

17
arXiv — NLP / Computation & Language research 16d ago

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are…

21
Hugging Face Daily Papers research 16d ago

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Abstract Nemotron 3 Ultra is a large-scale language model featuring hybrid Mamba-Attention architecture with 550 billion parameters, achieving high inference throughput and extended context length through specialized training techniques. Generated by…

5
Hugging Face Daily Papers research 16d ago

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as…

32

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

EnerInfer: Energy-Aware On-Device LLM Inference

A P\={a}ninian Foundation for Indic Language Processing

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Qwen-AgentWorld: Language World Models for General Agents

Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity

ComputeFHE: A Privacy-Preserving General-Purpose Computation Library

GLM 5.2 Fast via Wafer now available on AI Gateway

Vera: A Layered Diffusion Model for Content-Preserving Video Editing

What's your biggest pain point when choosing between cloud GPU providers for LLM inference?[R]

New ablation operator. (apostate)

A100 slow Qwen3.6-27B-FP8

Qwen 27B for planning, Qwen 35B-A3B for execution?

ROCm vs Vulkan vs vLLM on Dual R9700's

R9700 abysmal performance, getting desparate

I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config.

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Duration Aware Scheduling for ASR Serving Under Workload Drift

Holo-World: Unified Camera, Object and Weather Control for Video World Model

A Hybrid GNN-FEM Framework for Phase-Field Fracture Simulation. Physics-Preserving Hybridization for Generalizable Surrogate Modeling

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

Closing the Calibration Gap in Semantic Caching

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Performance-Driven Environment Abstraction with Multi-Timescale Learning

Memento: Reconstruct to Remember for Consistent Long Video Generation

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Repeated Bilateral Trade: The Quest for Fairness

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

VisualClaw: A Real-Time, Personalized Agent for the Physical World