Tag

Gpu

500 articles archived under #gpu · RSS

llama.cpp releases dev-tools 5d ago

b9810

CUDA: add cublasSgemmBatched mapping for HIP/MUSA vendor headers ( #25033 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64…

31
arXiv — Machine Learning research 6d ago

\chisao{}: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation

arXiv:2606.26164v1 Announce Type: new Abstract: Finding all modes of a multimodal black-box function is a fundamental challenge in optimization, Bayesian inference, and scientific computing. Existing approaches -- basin-hopping, CMA-ES, multistart gradient descent -- operate…

26
arXiv — Machine Learning research 6d ago

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

arXiv:2606.26453v1 Announce Type: new Abstract: We present KernelPro, a closed-loop multi-agent system that automatically generates, profiles, and iteratively optimizes GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and…

21
arXiv — Machine Learning research 6d ago

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

arXiv:2606.26666v1 Announce Type: new Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels…

20
arXiv — NLP / Computation & Language research 6d ago

Structure Before Collapse: Transient semantic geometry in next-token prediction

arXiv:2606.26749v1 Announce Type: cross Abstract: Neural Collapse predicts that balanced one-hot classification pushes model representations to be equally far from each other; a symmetric configuration that depends only on the output label and ignores any semantic similarity in…

29
arXiv — NLP / Computation & Language research 6d ago

Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

arXiv:2606.27023v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed…

15
arXiv — Machine Learning research 6d ago

Finding Stationary Points by Comparisons

arXiv:2606.27082v1 Announce Type: new Abstract: We study the problem of finding stationary points of non-convex functions when access to the objective is provided only through a comparison oracle that, given two points, outputs which has the larger function value. For a twice…

17
arXiv — NLP / Computation & Language research 6d ago

ProvenAI: Provenance-Native Traces of Evidence in Generated Answers

arXiv:2606.26449v1 Announce Type: new Abstract: Retrieval-augmented systems routinely present citations alongside generated answers, yet a citation does not confirm that the corresponding source meaningfully shaped the output. This paper introduces ProvenAI, a framework that…

17
arXiv — NLP / Computation & Language research 6d ago

\textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models

arXiv:2606.26530v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC;~\citealp{chollet2019measure}) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches…

22
arXiv — NLP / Computation & Language research 6d ago

GAVEL: Grounded Caption Error Verification and Localization

arXiv:2606.26923v1 Announce Type: new Abstract: Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy…

24
arXiv — NLP / Computation & Language research 6d ago

KARLA: Knowledge-base Augmented Retrieval for Language Models

arXiv:2606.26807v1 Announce Type: cross Abstract: We propose a new method that allows an LLM to automatically pull in factual knowledge from a knowledge base during token generation. This means that (1)~factual knowledge in the LLM output can be updated without retraining the…

12
arXiv — NLP / Computation & Language research 6d ago

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

arXiv:2606.27226v1 Announce Type: cross Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores…

14
r/LocalLLaMA community 6d ago

For dual GPUs, will there be any big impact to inference speeds when running in PCIe 5.0 x8/x4 vs x8/x8?

I bought the Biostar Z890 Valkyrie because it was on sale and had three PCIe 5.0 slots connected to the CPU (x16 or x8/x8 or x8/x4/x4), which I thought would be great for running dual GPUs for LLM inference. The problem is that now I want to add a SATA expansion card to the…

25
r/LocalLLaMA community 6d ago

When you don't have a data center GPU

Please don't tell me someone is going to (yet again) reply with the longest finetune-merge name in eternity...   submitted by   /u/Iwaku_Real [link]   [comments]

4
Latent.Space news-outlet 6d ago

[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.

It's happening.

37
r/LocalLLaMA community 6d ago

Built an open source local first Kanban workflow for running AI coding agents without babysitting every step

I’ve been building BatonBot, a local first app for running AI coding workflows with less babysitting. The problem I kept running into, especially with local models, is that coding agents can be useful but the workflow gets slow: start task → wait → check output → fix next issue…

10
r/LocalLLaMA community 6d ago

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything…

24
NVIDIA Developer Blog official-blog 6d ago

Streamlining Resource Binding with End-to-End Support for Vulkan Descriptor Heaps

Shaders are GPU programs that process visual data—such as rays, pixels, geometry, and textures—to produce specific rendering effects. Shaders find necessary...

32
r/MachineLearning community 6d ago

Kuma: compiling PyTorch models into self-contained WebGPU executables [P]

I've been experimenting with a compiler/runtime project that I'm not entirely sure is a good idea, so I'd love some feedback from people who've worked on deployment systems. The idea is to compile an exported PyTorch model into a self-contained package that contains: graph…

33
r/LocalLLaMA community 6d ago

DGX Spark OS lifetime?

I think of purchasing 2 DGX Sparks for my office (because a 700+W workstation would be intolerable) for LLM-centric work (inference only, no fine-tuning). I know the OS is based on Ubuntu 24.04. Has Nvidia ever disclosed what is the lifetime of the OS? Meaning, is there a chance…

17
r/LocalLLaMA community 6d ago

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Everything runs locally in your browser using custom WebGPU kernels written by Fable 5 (before it was shut down) and Opus 4.8. The video was recorded on my M4 Max. Model: LiquidAI/LFM2.5-230M ( GGUF ) Demo: https://huggingface.co/spaces/webml-community/lfm2-webgpu-kernels  …

37
Hugging Face Daily Papers research 6d ago

Forecasting Future Behavior as a Learning Task

Abstract Behavior Forecasters are trained to predict large reasoning model outputs from single trajectories, outperforming large language models while requiring significantly less computational cost. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Trust in an AI system is often…

24
NVIDIA Developer Blog official-blog 6d ago

Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support

Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the...

38
NVIDIA Developer Blog official-blog 6d ago

How KRAFTON Built PUBG Ally, a Co-Playable Character Powered by NVIDIA ACE

AI companions in games have long been constrained by scripted behavior trees and fixed dialogue. PUBG Ally is a different kind of system. Built by KRAFTON for...

26
r/LocalLLaMA community 6d ago

Tensor Split Fix for intel GPU's llama.cpp release b9788

sycl : support --split-mode tensor #24152 I'd like to see some numbers if anyone has 2xintel gpus and tries this out   submitted by   /u/Bulky-Priority6824 [link]   [comments]

10
r/LocalLLaMA community 6d ago

siq1 on kebab bench

tested my model on kebab bench and it performs very well: https://huggingface.co/spaces/AlexWortega/hermes-agent-zerogpu   submitted by   /u/Mysterious_Hearing14 [link]   [comments]

29
Hugging Face Daily Papers research 6d ago

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Abstract Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Jailbreak attacks reveal…

23
Hugging Face Daily Papers research 7d ago

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Abstract Tool Suppression occurs when JSON Schema constraints and tool calling are jointly enabled, preventing open-weight models from invoking tools despite maintaining schema compliance, with the issue stemming from grammar-based token masking that makes tool-call tokens…

5
llama.cpp releases dev-tools 7d ago

b9788

sycl : support --split-mode tensor ( #24152 ) Sycl tp stage1 ( #1 ) SYCL: tensor parallelism (--split-mode tensor) for dual-GPU Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend queries via get_proc_address to enable backend-specific all-reduce,…

33
r/LocalLLaMA community 7d ago

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. Instead of generating strictly one token at a time, it uses a frozen autoregressive context tower plus a diffusion denoiser tower…

38
r/LocalLLaMA community 7d ago

Worse quality with MTP - Qwen 3.6, Gemma 4

Hi. I am self-hosting Qwen 3.6 27B Q8_K_XL with Llama.cpp on 4x5070ti. (All 4 cards are on single x16 slot bifurcated to 4x4 with risers). I've been testing it on several work repos with Opencode CLI and in like 8/10 situations the output of non-MTP model is far superior to the…

8
Hugging Face Daily Papers research 7d ago

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

Abstract Two-channel evaluation shows output compression reduces costs while input compression increases costs and degrades accuracy across models and datasets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct "Talk short. Drop grammar. Save token." This caveman style is widely…

28
r/LocalLLaMA community 7d ago

If LLMs are so good at coding…

How come things like ROCm and the intel stack aren’t able to rapidly improve their software ecosystems to be a match for CUDA? Until the software from other vendors catches up with NVIDIA, they’re always going to get away with charging a massive premium on their “it just works”…

38
arXiv — Machine Learning research 7d ago

Quantifying Explainable AI-introduced signal noise on ECG data with Spectral Entropy

arXiv:2606.24974v1 Announce Type: new Abstract: Explainability techniques are used to assess the output of various deep learning models. This is especially true in healthcare, where models need to be trusted and decisions justified. Explainability (XAI) tools use heuristics…

22
arXiv — Machine Learning research 7d ago

Erased, but Not Gone: Output Forgetting Is Not True Forgetting

arXiv:2606.25001v1 Announce Type: new Abstract: Machine unlearning (MU) is commonly judged by output forgetting, such as low forget-set accuracy or reduced logit-level membership inference. But if output-level success can coexist with retraining-inconsistent residuals in…

26
arXiv — Machine Learning research 7d ago

Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion

arXiv:2606.25097v1 Announce Type: new Abstract: Speculative decoding accelerates inference by letting a draft model propose tokens for a target model to verify, raising a concrete safety question: at temperature zero, can draft-side behavior leak into safety-scored outputs? We…

7
arXiv — Machine Learning research 7d ago

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

arXiv:2606.25432v1 Announce Type: new Abstract: Inference efficiency is typically pursued by shrinking the model: distillation, pruning, quantization, and sparse routing each lower per-token cost while treating token count as fixed. But output length has been inflating, and it…

28
arXiv — NLP / Computation & Language research 7d ago

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

arXiv:2606.25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it…

5
arXiv — NLP / Computation & Language research 7d ago

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

arXiv:2606.25605v1 Announce Type: new Abstract: Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed…

10
arXiv — NLP / Computation & Language research 7d ago

Weave of Formal Thought

arXiv:2606.25987v1 Announce Type: new Abstract: Large language models (LLMs) attain remarkable surface fluency on code, yet they neither formally guarantee the syntactic validity of their output nor leverage the hierarchical structure defining the target language. While existing…

18
arXiv — NLP / Computation & Language research 7d ago

Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

arXiv:2606.25721v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate model outputs through malicious retrieved documents. Existing detection methods typically rely on auxiliary classifiers or…

30
arXiv — NLP / Computation & Language research 7d ago

RAS: Measuring LLM Safety Through Refusal Alignment

arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is…

27
arXiv — NLP / Computation & Language research 7d ago

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

arXiv:2509.22193v2 Announce Type: replace Abstract: Distilling reasoning traces from strong teacher models has become the standard recipe for building capable small language models. Yet reasoning traces are 5-20$\times$ longer than standard instruction fine-tuning (IFT) outputs,…

19
r/LocalLLaMA community 7d ago

Any chance I could cluster my DGX Spark (128GB unified memory) and my AMD Ryzen AI Max 395 (128GM unified memory) together to run 1 model?

Hey all, So I have a Nvidia DGX Spark and an AMD Strix 395, both have 128GB of unified memory. The Spark has 200Gbit network and the AMD Strix has 5Gbit ethernet (but it has a pcie gen 4x4 slot). Is there any chance I can cluster the 2 together to run a larger model that can fit…

30
r/LocalLLaMA community 7d ago

SDXL running locally in the browser on WebGPU, open-source

I needed simple local image generation without the usual setup. No virtual environments, no ComfyUI with a complex graph and installation as an exe. So i tried to push the whole thing into the browser and run it on WebGPU. It's a browser extension. You install it, then it loads…

13
r/MachineLearning community 7d ago

MuJoCo derived Simulator for High Fidelity Vision RL training natively on GPU [D]

Hi everyone, For the past couple of weeks I have been working on a simulator project considering the shortcomings of MuJoCo. There are things that people like and also don't like about MuJoCo, like the CPU dependency on MuJoCo which makes the simulation not parallelizable beyond…

31
r/LocalLLaMA community 7d ago

MINISFORUM DEG1 Oculink eGPU Dock Refurbished - $59

I got one of these refurbished units last year. I have nothing but good things to say about it. It works great. It has heft to it to keep the GPU secure. And unlike some cheaper Oculink docks, it has redrivers for signal integrity.   submitted by   /u/fallingdowndizzyvr…

20
r/LocalLLaMA community 7d ago

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

I'm wondering if anyone else has come across this. I've tested the same model on llama.cpp and vLLM with similar settings and quantizations. The performance and concurrency in vLLM are much noticeably better, but sometimes the model feels less reliable. Some things I've noticed:…

27
NVIDIA Developer Blog official-blog 7d ago

Accelerating BEV Pooling on NVIDIA GPUs for Physical AI Applications

An increasingly common design pattern for autonomous vehicles (AVs), robotics, and spatial AI systems is bird's-eye-view (BEV) perception. BEV models project...

31
Hugging Face official-blog 7d ago

Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel

Back to Articles a]:hidden"> Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel Enterprise + Article Published June 24, 2026 Upvote - Adil Asif adil-asif nvidia Alexandros Koumparoulis akoumpa nvidia Wenwen Gao wgao2021 nvidia Sylendran Arunagiri Sylendran95 nvidia…

29

b9810

\chisao{}: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Structure Before Collapse: Transient semantic geometry in next-token prediction

Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

Finding Stationary Points by Comparisons

ProvenAI: Provenance-Native Traces of Evidence in Generated Answers

\textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models

GAVEL: Grounded Caption Error Verification and Localization

KARLA: Knowledge-base Augmented Retrieval for Language Models

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

For dual GPUs, will there be any big impact to inference speeds when running in PCIe 5.0 x8/x4 vs x8/x8?

When you don't have a data center GPU

[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.

Built an open source local first Kanban workflow for running AI coding agents without babysitting every step

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

Streamlining Resource Binding with End-to-End Support for Vulkan Descriptor Heaps

Kuma: compiling PyTorch models into self-contained WebGPU executables [P]

DGX Spark OS lifetime?

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Forecasting Future Behavior as a Learning Task

Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support

How KRAFTON Built PUBG Ally, a Co-Playable Character Powered by NVIDIA ACE

Tensor Split Fix for intel GPU's llama.cpp release b9788

siq1 on kebab bench

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

b9788

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.

Worse quality with MTP - Qwen 3.6, Gemma 4

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

If LLMs are so good at coding…

Quantifying Explainable AI-introduced signal noise on ECG data with Spectral Entropy

Erased, but Not Gone: Output Forgetting Is Not True Forgetting

Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Weave of Formal Thought

Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

RAS: Measuring LLM Safety Through Refusal Alignment

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

Any chance I could cluster my DGX Spark (128GB unified memory) and my AMD Ryzen AI Max 395 (128GM unified memory) together to run 1 model?

SDXL running locally in the browser on WebGPU, open-source

MuJoCo derived Simulator for High Fidelity Vision RL training natively on GPU [D]

MINISFORUM DEG1 Oculink eGPU Dock Refurbished - $59

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

Accelerating BEV Pooling on NVIDIA GPUs for Physical AI Applications

Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel