Tag

Gpu

500 articles archived under #gpu · RSS

r/LocalLLaMA community 1h ago

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs? Mac can host large models but the prefill speed sucks, so I tested in it on my setup for…

25
r/LocalLLaMA community 2h ago

They fit! Mostly.... 2x 3090, Thermaltake Core p3

Got another 3090 had to print a bracket to angle the radiator and make room for the GPUs 💀 ended up liking the look more than I thought ..qwen 27b go brrrrr   submitted by   /u/anthonyg45157 [link]   [comments]

6
arXiv — Machine Learning research 2h ago

Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol

arXiv:2607.00089v1 Announce Type: new Abstract: Mechanistic interpretability has produced a rich inventory of component-level analyses that characterise what neural-network components encode and how they interact. Their outputs, however, are not easily reusable: selectivity…

8
arXiv — Machine Learning research 2h ago

SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling

arXiv:2607.00095v1 Announce Type: new Abstract: Generative models have emerged as scalable surrogates for physical simulation, yet they offer no guarantee that their outputs respect the conservation laws, boundary conditions, and nonlinear invariants that govern the underlying…

15
arXiv — NLP / Computation & Language research 2h ago

Watermarking for Proprietary Dataset Protection

arXiv:2607.00325v1 Announce Type: cross Abstract: A growing body of literature suggests that training data membership inference problems are fundamentally hard tasks in modern language modeling settings. We argue that output watermarking techniques are the right gadget to make…

8
arXiv — Machine Learning research 2h ago

Prototype Language Models

arXiv:2607.00510v1 Announce Type: new Abstract: Knowing which training examples drive outputs is fundamental to auditing, correcting, and understanding language models, yet for modern LLMs this remains expensive, approximate, and largely post-hoc. Standard language models…

22
arXiv — Machine Learning research 2h ago

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

arXiv:2607.00760v1 Announce Type: new Abstract: Long-context LLM services now sustain prompts with hundreds of thousands to millions of tokens, making the key-value (KV) cache a first-order serving cost. Because the cache grows linearly with context length, it can exhaust GPU…

9
arXiv — NLP / Computation & Language research 2h ago

Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

arXiv:2607.00895v1 Announce Type: new Abstract: Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool…

14
r/LocalLLaMA community 10h ago

Anyone using TensTorrent gpus for your local ai? What's been your experience?

I'm always keeping an eye on competitive hardware and was looking at tenstorrent cards, particularly the p150a which while its memory bandwidth is only 512GB/s, it does have 32 GB of GDDR6 and a high-speed Ethernet fabric (4×800 GbE) so multi-card systems don't rely on PCIe…

36
r/MachineLearning community 11h ago

Spot/interruptible H100 and A100 pricing across RunPod, Vast.ai, and AWS - June 2026 data [D]

Following up on the on-demand comparison from a couple weeks back - pulled spot/ interruptible pricing this time since that's where the real savings conversation actually lives for anyone running checkpointed training or batch jobs. Checked: June 2026. Spot/interruptible tier,…

35
r/LocalLLaMA community 12h ago

Llama-b9856 Win Cuda 12.4 - Windows Defender claims it's a trojan

Hi, just downloaded this release earlier today. Attempted to run llama-server, and Windows Defender shut it down. It says it's Wacatac.H!ml. It removed the llama-server-impl.dll file from the folder. Older releases work fine   submitted by   /u/Far_Course2496 [link]…

10
r/LocalLLaMA community 13h ago

July 2026; where are Intel's GPU speeds today at?

Hey all, it is 1st of July, 2H26, and I hope that Intel has been catching up on their firmware support for their B50-B70 cards in recent months. In some places of the world, they do sound like a good VRAM/money offer, and hence I would love for you to share your recent PP / TG…

19
r/LocalLLaMA community 14h ago

Open Models - June 2026

After overwhelming April , OK May , here's June. Yeah, Graph has only less items. Because we got other items here last month. Finetunes : Nex-N2 Ornith-1.0 Agents-A1 Holo3.1 Tmax-27b MusaCoder-27B VibeThinker-3B NVFP4 from NVIDIA for below models :…

8
Hugging Face Daily Papers research 18h ago

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Abstract Flexible Spoken Language Model (FlexiSLM) introduces dynamic frame rate capabilities for speech input and output, achieving superior performance over fixed-frame-rate models while enabling controllable inference speed. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spoken…

15
llama.cpp releases dev-tools 18h ago

b9856

CUDA: consistent use of restrict + PDL for FA ( #25185 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64…

32
r/LocalLLaMA community 19h ago

Thinking about grabbing 4x Ascend GX10s

Some in this sub have tested GLM5.2 on 4x DGX Sparks (or Ascend GX10) with 400-500 tok/s prompt processing and ~15 tok/s output at 128k context. Not blazing fast, but usable imo, especially with quantization. My thinking: If there's an open-source fable 5 sometime in december or…

20
Hugging Face Daily Papers research 23h ago

Little Brains, Big Feats: Exploring Compact Language Models

Abstract Small language models can effectively perform retrieval-augmented generation tasks directly on-device without GPU acceleration. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While large language models have been dominating the research landscape recently, small language…

13
arXiv — NLP / Computation & Language research 1d ago

Revocable Learned State via Process Sidecars

arXiv:2606.30788v1 Announce Type: cross Abstract: Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not…

17
arXiv — Machine Learning research 1d ago

Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning

arXiv:2606.31092v1 Announce Type: new Abstract: Full fine-tuning adapts large language models to new tasks but can erode capabilities they already possess. Existing remedies protect through proxies such as parameter distances, importance penalties, output matching, or dominant…

11
arXiv — Machine Learning research 1d ago

TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling

arXiv:2606.31268v1 Announce Type: new Abstract: The growing demand for privacy-preserving data sharing has positioned synthetic data generation as a critical component of responsible AI workflows. Despite notable advances in generative modeling, existing solutions often lack…

29
arXiv — Machine Learning research 1d ago

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

arXiv:2606.32008v1 Announce Type: new Abstract: Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do…

28
arXiv — NLP / Computation & Language research 1d ago

CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations

arXiv:2606.31033v1 Announce Type: new Abstract: In this paper, we propose CORTEX, a token-level hallucination detection method for Retrieval-Augmented Generation (RAG). In long-form RAG outputs, hallucinations often arise in localized spans rather than throughout an entire…

20
arXiv — NLP / Computation & Language research 1d ago

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

arXiv:2606.31145v1 Announce Type: new Abstract: Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and must be retained throughout decoding, making full GPU caching…

11
arXiv — NLP / Computation & Language research 1d ago

Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law

arXiv:2606.31250v1 Announce Type: new Abstract: Large language models (LLM) trained on web-scale corpora generate output that may infringe copyright, yet existing technical safeguards focus narrowly on verbatim memorisation. EU copyright doctrine applies a broader standards:…

36
arXiv — NLP / Computation & Language research 1d ago

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

arXiv:2606.31796v1 Announce Type: new Abstract: We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output…

14
arXiv — NLP / Computation & Language research 1d ago

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

arXiv:2606.31407v1 Announce Type: cross Abstract: Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows…

15
arXiv — NLP / Computation & Language research 1d ago

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

arXiv:2606.31511v1 Announce Type: cross Abstract: In deployment settings where retraining is infeasible, small frozen code models are routinely asked to repair a failed program after seeing their own failing output, usually treated as a retry mechanism. From a Popperian view, a…

9
arXiv — NLP / Computation & Language research 1d ago

Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models

arXiv:2410.12341v4 Announce Type: replace Abstract: As AI-generated content increasingly populates the web, generative AI models are at growing risk of being trained on their own outputs, a process known as AI autophagy. This feedback loop has been shown to induce model…

16
r/LocalLLaMA community 1d ago

DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp

Bartowski's DeepSeek-V4-Flash-MXFP4 GGUF, llama.cpp build 9851 ( 0eca4d490 ), deepseek4 arch. Ran the same n_ctx = 10240 , same n_ubatch = n_batch = 8192 , flash attention on — only difference is -ctk / -ctv : Cache type Total KV cache (CUDA0) CUDA0 compute buffer f16 (default,…

18
Vercel — AI dev-tools 1d ago

Dry-run deployments with Vercel CLI

You can now preview the framework preset and files that Vercel CLI includes in a deployment before creating one. Run vercel deploy --dry from a linked project: For automation or further inspection, return the complete file manifest as JSON: JSON output includes the detected…

17
r/LocalLLaMA community 1d ago

Is there an alternative to C-Payne for 100-lane PCIe 5.0 switches? Needed for 8-GPU build.

Sadly Christian is on vacation or something, which is a shame because the C-Payne PCIe gear is the best around. In the meantime I need this to add some urgent compute capacity:…

15
llama.cpp releases dev-tools 1d ago

b9851

cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel ( #24945 ) Co-authored-by: Stanisław Szymczyk sszymczy@gmail.com macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED…

10
TechCrunch — AI news-outlet 1d ago

Nvidia competitor Etched hits $5B valuation, $1B in sales for AI chip

Nvidia AI chip competitor Etched says it has already booked $1 billion under contract for the inference systems powered by its chip.

19
NVIDIA Developer Blog official-blog 1d ago

Designing GPU-Accelerated Query Engines with NVIDIA GQE

GPU-accelerated query engines are often constrained by memory and I/O bandwidth. NVIDIA hardware advances—including high bandwidth memory (HBM), NVIDIA...

36
r/LocalLLaMA community 1d ago

HIP: use hipBLAS for dense prefill on gfx900, keep MMQ for MoE by DEV-DUFORD · Pull Request #24588 · ggml-org/llama.cpp

Overall Performance Gains: Qwen3.5 4B : +36.1% Qwen3.6 27B : +18.9% Gemma4 12B : +65.1% Overall average : ~40% Only for gfx900 related GPUs: Vega GPU, codename vega10, including Radeon Vega Frontier Edition, Radeon RX Vega 56/64, Radeon RX Vega 64 Liquid, Radeon Pro Vega…

5
NVIDIA Developer Blog official-blog 1d ago

Optimizing a Neural Reconstruction Pipeline Using NVIDIA Nsight Developer Tools

NVIDIA Ominverse NuRec is a neural reconstruction pipeline for building high-fidelity 3D representations of real-world environments from multisensor data such...

8
llama.cpp releases dev-tools 1d ago

b9848

CUDA: fix get_rows_back for tables with more than 65535 rows (grid-y clamp + stride) ( #25103 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x…

9
llama.cpp releases dev-tools 1d ago

b9847

CUDA: fix Gemma E4B MTP FlashAttention ( #25148 ) CUDA: fix Gemma E4B MTP FlashAttention remove unused template declaration macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU)…

16
r/LocalLLaMA community 1d ago

nvidia/Qwen3.6-27B-NVFP4 just dropped

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4   submitted by   /u/vanbukin [link]   [comments]

37
llama.cpp releases dev-tools 1d ago

b9844

ggml-webgpu: add support for NVFP4 ( #25143 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan)…

19
arXiv — Machine Learning research 2d ago

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

arXiv:2606.28615v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these…

31
arXiv — Machine Learning research 2d ago

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

arXiv:2606.28839v1 Announce Type: new Abstract: We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification…

38
arXiv — Machine Learning research 2d ago

When Can Conformal Risk Control Certify LLM Outputs? Bounds, Impossibility, and Adaptation for Structured Generation

arXiv:2606.29054v1 Announce Type: new Abstract: Large language models (LLMs) deployed for structured generation (NER, JSON extraction, QA, and classification) lack formal reliability guarantees, and standard heuristic abstention policies miss user-specified risk targets by…

4
arXiv — Machine Learning research 2d ago

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

arXiv:2606.29110v1 Announce Type: new Abstract: Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating…

32
arXiv — NLP / Computation & Language research 2d ago

Beyond the Mean: Three-Axis Fidelity for Aligning LLM-Based Survey Simulators from Small Pilot Data

arXiv:2606.28963v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to simulate social survey responses, yet their outputs exhibit systematic biases: marginal distributions are skewed, response variance is poorly calibrated, and predictor-outcome…

20
arXiv — NLP / Computation & Language research 2d ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

arXiv:2606.29082v1 Announce Type: new Abstract: Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on…

4
arXiv — NLP / Computation & Language research 2d ago

AURORA: Asymmetry and Update-Induced Rotation for Robust Hallucination Detection in Large Language Models

arXiv:2606.29545v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to generate hallucinations, namely factually incorrect or unfaithful outputs,…

27
arXiv — NLP / Computation & Language research 2d ago

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

arXiv:2606.29712v1 Announce Type: new Abstract: Large language models achieve high reasoning performance via explicit chain-of-thought and reinforcement learning, but require long output sequences and extended inference time. Latent reasoning reduces this cost by shifting…

22
arXiv — NLP / Computation & Language research 2d ago

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

arXiv:2606.29809v1 Announce Type: new Abstract: Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating…

27
r/LocalLLaMA community 2d ago

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.…

19

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

They fit! Mostly.... 2x 3090, Thermaltake Core p3

Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol

SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling

Watermarking for Proprietary Dataset Protection

Prototype Language Models

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

Anyone using TensTorrent gpus for your local ai? What's been your experience?

Spot/interruptible H100 and A100 pricing across RunPod, Vast.ai, and AWS - June 2026 data [D]

Llama-b9856 Win Cuda 12.4 - Windows Defender claims it's a trojan

July 2026; where are Intel's GPU speeds today at?

Open Models - June 2026

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

b9856

Thinking about grabbing 4x Ascend GX10s

Little Brains, Big Feats: Exploring Compact Language Models

Revocable Learned State via Process Sidecars

Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning

TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models

DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp

Dry-run deployments with Vercel CLI

Is there an alternative to C-Payne for 100-lane PCIe 5.0 switches? Needed for 8-GPU build.

b9851

Nvidia competitor Etched hits $5B valuation, $1B in sales for AI chip

Designing GPU-Accelerated Query Engines with NVIDIA GQE

HIP: use hipBLAS for dense prefill on gfx900, keep MMQ for MoE by DEV-DUFORD · Pull Request #24588 · ggml-org/llama.cpp

Optimizing a Neural Reconstruction Pipeline Using NVIDIA Nsight Developer Tools

b9848

b9847

nvidia/Qwen3.6-27B-NVFP4 just dropped

b9844

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

When Can Conformal Risk Control Certify LLM Outputs? Bounds, Impossibility, and Adaptation for Structured Generation

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

Beyond the Mean: Three-Axis Fidelity for Aligning LLM-Based Survey Simulators from Small Pilot Data

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

AURORA: Asymmetry and Update-Induced Rotation for Robust Hallucination Detection in Large Language Models

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought