News / #inference Tag Inference 356 articles archived under #inference · RSS Sign in to follow arXiv — Machine Learning research 8d ago Learning to Trigger: Reinforcement Learning at the Large Hadron Collider arXiv:2606.23993v1 Announce Type: new Abstract: High-throughput scientific facilities such as the Large Hadron Collider depend on real-time event filtering (\textit{triggering}) under tight constraints on bandwidth, latency, and storage. In practice, trigger menus are largely… 24 arXiv — Machine Learning research 8d ago EnerInfer: Energy-Aware On-Device LLM Inference arXiv:2606.23001v1 Announce Type: cross Abstract: On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding… 13 arXiv — NLP / Computation & Language research 8d ago A P\={a}ninian Foundation for Indic Language Processing arXiv:2606.24172v1 Announce Type: new Abstract: More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks… 24 arXiv — Machine Learning research 8d ago CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is… 8 arXiv — NLP / Computation & Language research 8d ago Qwen-AgentWorld: Language World Models for General Agents arXiv:2606.24597v1 Announce Type: new Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can… 8 arXiv — NLP / Computation & Language research 8d ago Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity arXiv:2606.24623v1 Announce Type: new Abstract: Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent… 30 arXiv — NLP / Computation & Language research 8d ago ComputeFHE: A Privacy-Preserving General-Purpose Computation Library arXiv:2606.24379v1 Announce Type: cross Abstract: Fully Homomorphic Encryption (FHE) enables computations to be performed directly on encrypted data while preserving data confidentiality. However, its practical applications remain limited by high computational costs and… 6 Vercel — AI dev-tools 8d ago GLM 5.2 Fast via Wafer now available on AI Gateway GLM 5.2 Fast via Wafer is now available on AI Gateway . Based on our own benchmarking across small-context, large-context, and tool-call scenarios, Wafer delivers a 2x higher throughput than other providers serving GLM-5.2 on serverless, leading on decode and end-to-end speed… 7 Hugging Face Daily Papers research 8d ago Vera: A Layered Diffusion Model for Content-Preserving Video Editing Abstract Vera is a layered diffusion framework that preserves video content during editing by generating edit layers and alpha mattes through a Mixture-of-Transformers architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video diffusion models have enabled remarkable… 10 r/MachineLearning community 8d ago What's your biggest pain point when choosing between cloud GPU providers for LLM inference?[R] Trying to understand how other people make this decision. Do you compare $/hr, $/token, throughput, reliability? Is there a tool or resource you rely on, or are you just doing the math manually? Asking because I'm an ML engineer who's been doing this in spreadsheets and… 14 r/LocalLLaMA community 9d ago New ablation operator. (apostate) Today I added a new operator to apostate. This new operator is a contrastive co-vector edit E = I − R Dᵀ . Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless variance along it leaves the refusal that is entangled with general… 34 r/LocalLLaMA community 10d ago A100 slow Qwen3.6-27B-FP8 Setting up a server for someone who has an A100 80GB, even though this doesn't natively support FP8 does 43tps decode sound too low for single request? For comparison the exact same vllm config on my RTX 6000 PRO runs the same single request test at 130tps. For 8 concurrent… 11 r/LocalLLaMA community 10d ago Qwen 27B for planning, Qwen 35B-A3B for execution? My 32GB unified memory setup runs both, though 27B even with MTP is something like 7-10 tok/sec. Usable but not real time by any means. (~18 tok/sec with 35B-A3B) Would it be worth using 27B to plan long horizon tasks, put together the PLAN.md, and have 35B-A4B iterate over it… 14 r/LocalLLaMA community 10d ago ROCm vs Vulkan vs vLLM on Dual R9700's Just wanted to share these numbers I saw running Qwen3.6 35BA3 and Qwen3.6 27B and the big increase I saw going to vLLM. I was just expecting better concurrency but ended up with a lot better speeds. llama.cpp services Running ROCm and Vulkan Model Backend Gen 35B-A3B Q6_K_XL… 19 r/LocalLLaMA community 10d ago R9700 abysmal performance, getting desparate I've been trying to get my 2x R9700 setup to work for the past two weeks. This has been such a time sink I wish I had just gone with nvidia. At this point I'm close to selling the cards. I need vLLM. This is a dedicated setup for multi-user serving. I've tried the… 17 r/LocalLLaMA community 11d ago I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config. If you run open-source models and want to understand what's actually happening under the hood — I spent the last few months writing a 15-part series that covers the full stack from tokenization to production serving. Most articles are grounded in Gemma 4 12B as the running… 19 r/MachineLearning community 11d ago An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P] I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook. Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and… 13 r/LocalLLaMA community 12d ago $1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost. Setup: 4x 5060 ti (16GB) with P2P If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB… 30 Hugging Face Daily Papers research 12d ago Duration Aware Scheduling for ASR Serving Under Workload Drift Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by… 26 Hugging Face Daily Papers research 13d ago Holo-World: Unified Camera, Object and Weather Control for Video World Model Abstract A unified controllable video world model generates videos from a single image while preserving scene structure and transferring to target weather states through specialized parameterization and conditioning techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video… 22 arXiv — Machine Learning research 13d ago A Hybrid GNN-FEM Framework for Phase-Field Fracture Simulation. Physics-Preserving Hybridization for Generalizable Surrogate Modeling arXiv:2606.19378v1 Announce Type: new Abstract: Scientific machine learning (SciML) has emerged as a promising approach for accelerating simulations of complex physical systems, yet achieving physically consistent and generalizable predictions for nonlinear, history-dependent… 28 arXiv — Machine Learning research 13d ago LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing arXiv:2606.19679v1 Announce Type: new Abstract: Lifelong knowledge editing aims to efficiently and sequentially update language models over time, as new knowledge becomes available or when the model makes mistakes, while preserving acceptable performance on past knowledge. One… 31 arXiv — Machine Learning research 13d ago An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling arXiv:2606.19770v1 Announce Type: new Abstract: We propose an information-theoretic framework for graph novelty generation, which aims to generate data that are distinct from existing patterns while preserving global structural consistency. Our approach embeds data into a latent… 32 arXiv — Machine Learning research 13d ago Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs arXiv:2606.19993v1 Announce Type: new Abstract: We present Activation- and Influence-Aware Ranks (AIR), an SVD-based LLM compression framework that guides each weight matrix's low-rank approximation with a backward-signal influence metric. Starting from the activation-aware… 38 arXiv — NLP / Computation & Language research 13d ago CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference arXiv:2606.19667v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token… 15 arXiv — NLP / Computation & Language research 13d ago Closing the Calibration Gap in Semantic Caching arXiv:2606.19719v1 Announce Type: cross Abstract: Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether… 26 arXiv — NLP / Computation & Language research 13d ago Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning arXiv:2606.19808v1 Announce Type: cross Abstract: Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes.… 25 r/LocalLLaMA community 13d ago GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster TLDR: For the first time, I feel relief that they could shut down the cloud services and I would be ok. I got my 4th 3090 and then unsloth dropped the Q2 and Q1. I wrote nothing else here its from CC, so it might be wrong. GLM-5.2 UD-IQ2_M runs across 4×3090 + RAM expert offload… 7 r/LocalLLaMA community 13d ago DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts... Figured I'd post up a bit of info for anyone else who was thinking about messing with this model on a 3090/4090. Obviously I can't use the nvfp4, but I got it up and running in vLLM using diffusiongemma-26B-A4B-it-AWQ-INT4. Had to run it in a custom vLLM docker they provide for… 34 r/MachineLearning community 13d ago Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R] I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU." As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified… 29 r/LocalLLaMA community 13d ago NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable The best i can get from Qwen3.6-27B on my 32GB VRAM (2 x 5060) is ~60 tok/sec gen speed at context size 196608. (sakamakismile text nvfp4). Fp8 kv quantization. NVFP4 kv cache quantization can’t get here fast enough. Reminds me of the time there was this game i couldn’t play on… 38 arXiv — Machine Learning research 14d ago SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector arXiv:2606.18309v1 Announce Type: new Abstract: Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that… 24 arXiv — Machine Learning research 14d ago SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System arXiv:2606.18384v1 Announce Type: new Abstract: Hierarchical Federated Learning (HFL) enables scalable collaborative model training across distributed devices while preserving data privacy. However, existing HFL client selection mechanisms suffer from a fundamental strategic… 31 arXiv — Machine Learning research 14d ago Beyond Prediction: Tail-Aware Scheduling for LLM Inference arXiv:2606.18431v1 Announce Type: new Abstract: LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such… 13 arXiv — Machine Learning research 14d ago PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization arXiv:2606.18518v1 Announce Type: new Abstract: The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution,… 4 arXiv — Machine Learning research 14d ago Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals,… 18 arXiv — Machine Learning research 14d ago PACT: Preserving Anchored Cores in Task-vectors for Model Merging arXiv:2606.18627v1 Announce Type: new Abstract: Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task… 30 arXiv — NLP / Computation & Language research 14d ago PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning arXiv:2606.18473v1 Announce Type: new Abstract: Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often… 15 r/LocalLLaMA community 14d ago Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5 Before Fable 5 was shutdown, it helped us optimize our Gemma 4 WebGPU kernels, reaching around 255 tokens per second on my M4 Max. Today, we're releasing the demo and kernels for you to try out yourself. Hope you find it interesting! Links: - Demo (+ kernels):… 9 arXiv — Machine Learning research 15d ago Performance-Driven Environment Abstraction with Multi-Timescale Learning arXiv:2606.17377v1 Announce Type: new Abstract: We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We… 8 Hugging Face Daily Papers research 16d ago Memento: Reconstruct to Remember for Consistent Long Video Generation Abstract Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-form video generation requires… 17 Hugging Face Daily Papers research 16d ago Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving Abstract Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management. Generated… 14 arXiv — Machine Learning research 16d ago Repeated Bilateral Trade: The Quest for Fairness arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the… 34 arXiv — Machine Learning research 16d ago InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset arXiv:2606.15730v1 Announce Type: new Abstract: Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common… 35 arXiv — NLP / Computation & Language research 16d ago Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation arXiv:2606.15266v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains… 16 arXiv — NLP / Computation & Language research 16d ago Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning arXiv:2606.15333v1 Announce Type: new Abstract: LLM unlearning has emerged as a cost-effective alternative to full retraining for removing hazardous knowledge from pretrained models while preserving general utility. Recent RL-based methods such as RULE reformulate unlearning as… 5 arXiv — NLP / Computation & Language research 16d ago Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations arXiv:2606.15335v1 Announce Type: new Abstract: When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and… 17 arXiv — NLP / Computation & Language research 16d ago Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are… 21 Hugging Face Daily Papers research 16d ago Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning Abstract Nemotron 3 Ultra is a large-scale language model featuring hybrid Mamba-Attention architecture with 550 billion parameters, achieving high inference throughput and extended context length through specialized training techniques. Generated by… 5 Hugging Face Daily Papers research 16d ago VisualClaw: A Real-Time, Personalized Agent for the Physical World Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as… 32 Page 2 of 8 · 356 articles ← Newer Older →