News / #gpu Tag Gpu 500 articles archived under #gpu · RSS Sign in to follow r/LocalLLaMA community 1h ago [Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs? Mac can host large models but the prefill speed sucks, so I tested in it on my setup for… 25 r/LocalLLaMA community 2h ago They fit! Mostly.... 2x 3090, Thermaltake Core p3 Got another 3090 had to print a bracket to angle the radiator and make room for the GPUs 💀 ended up liking the look more than I thought ..qwen 27b go brrrrr   submitted by   /u/anthonyg45157 [link]   [comments] 6 arXiv — Machine Learning research 2h ago Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol arXiv:2607.00089v1 Announce Type: new Abstract: Mechanistic interpretability has produced a rich inventory of component-level analyses that characterise what neural-network components encode and how they interact. Their outputs, however, are not easily reusable: selectivity… 8 arXiv — Machine Learning research 2h ago SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling arXiv:2607.00095v1 Announce Type: new Abstract: Generative models have emerged as scalable surrogates for physical simulation, yet they offer no guarantee that their outputs respect the conservation laws, boundary conditions, and nonlinear invariants that govern the underlying… 15 arXiv — NLP / Computation & Language research 2h ago Watermarking for Proprietary Dataset Protection arXiv:2607.00325v1 Announce Type: cross Abstract: A growing body of literature suggests that training data membership inference problems are fundamentally hard tasks in modern language modeling settings. We argue that output watermarking techniques are the right gadget to make… 8 arXiv — Machine Learning research 2h ago Prototype Language Models arXiv:2607.00510v1 Announce Type: new Abstract: Knowing which training examples drive outputs is fundamental to auditing, correcting, and understanding language models, yet for modern LLMs this remains expensive, approximate, and largely post-hoc. Standard language models… 22 arXiv — Machine Learning research 2h ago MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression arXiv:2607.00760v1 Announce Type: new Abstract: Long-context LLM services now sustain prompts with hundreds of thousands to millions of tokens, making the key-value (KV) cache a first-order serving cost. Because the cache grows linearly with context length, it can exhaust GPU… 9 arXiv — NLP / Computation & Language research 2h ago Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents arXiv:2607.00895v1 Announce Type: new Abstract: Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool… 14 r/LocalLLaMA community 10h ago Anyone using TensTorrent gpus for your local ai? What's been your experience? I'm always keeping an eye on competitive hardware and was looking at tenstorrent cards, particularly the p150a which while its memory bandwidth is only 512GB/s, it does have 32 GB of GDDR6 and a high-speed Ethernet fabric (4×800 GbE) so multi-card systems don't rely on PCIe… 36 r/MachineLearning community 11h ago Spot/interruptible H100 and A100 pricing across RunPod, Vast.ai, and AWS - June 2026 data [D] Following up on the on-demand comparison from a couple weeks back - pulled spot/ interruptible pricing this time since that's where the real savings conversation actually lives for anyone running checkpointed training or batch jobs. Checked: June 2026. Spot/interruptible tier,… 35 r/LocalLLaMA community 12h ago Llama-b9856 Win Cuda 12.4 - Windows Defender claims it's a trojan Hi, just downloaded this release earlier today. Attempted to run llama-server, and Windows Defender shut it down. It says it's Wacatac.H!ml. It removed the llama-server-impl.dll file from the folder. Older releases work fine   submitted by   /u/Far_Course2496 [link]… 10 r/LocalLLaMA community 13h ago July 2026; where are Intel's GPU speeds today at? Hey all, it is 1st of July, 2H26, and I hope that Intel has been catching up on their firmware support for their B50-B70 cards in recent months. In some places of the world, they do sound like a good VRAM/money offer, and hence I would love for you to share your recent PP / TG… 19 r/LocalLLaMA community 14h ago Open Models - June 2026 After overwhelming April , OK May , here's June. Yeah, Graph has only less items. Because we got other items here last month. Finetunes : Nex-N2 Ornith-1.0 Agents-A1 Holo3.1 Tmax-27b MusaCoder-27B VibeThinker-3B NVFP4 from NVIDIA for below models :… 8 Hugging Face Daily Papers research 18h ago FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model Abstract Flexible Spoken Language Model (FlexiSLM) introduces dynamic frame rate capabilities for speech input and output, achieving superior performance over fixed-frame-rate models while enabling controllable inference speed. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spoken… 15 llama.cpp releases dev-tools 18h ago b9856 CUDA: consistent use of restrict + PDL for FA ( #25185 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64… 32 r/LocalLLaMA community 19h ago Thinking about grabbing 4x Ascend GX10s Some in this sub have tested GLM5.2 on 4x DGX Sparks (or Ascend GX10) with 400-500 tok/s prompt processing and ~15 tok/s output at 128k context. Not blazing fast, but usable imo, especially with quantization. My thinking: If there's an open-source fable 5 sometime in december or… 20 Hugging Face Daily Papers research 23h ago Little Brains, Big Feats: Exploring Compact Language Models Abstract Small language models can effectively perform retrieval-augmented generation tasks directly on-device without GPU acceleration. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While large language models have been dominating the research landscape recently, small language… 13 arXiv — NLP / Computation & Language research 1d ago Revocable Learned State via Process Sidecars arXiv:2606.30788v1 Announce Type: cross Abstract: Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not… 17 arXiv — Machine Learning research 1d ago Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning arXiv:2606.31092v1 Announce Type: new Abstract: Full fine-tuning adapts large language models to new tasks but can erode capabilities they already possess. Existing remedies protect through proxies such as parameter distances, importance penalties, output matching, or dominant… 11 arXiv — Machine Learning research 1d ago TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling arXiv:2606.31268v1 Announce Type: new Abstract: The growing demand for privacy-preserving data sharing has positioned synthetic data generation as a critical component of responsible AI workflows. Despite notable advances in generative modeling, existing solutions often lack… 29 arXiv — Machine Learning research 1d ago Surrogate Fidelity: When Can Open LLMs Explain Closed Ones? arXiv:2606.32008v1 Announce Type: new Abstract: Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do… 28 arXiv — NLP / Computation & Language research 1d ago CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations arXiv:2606.31033v1 Announce Type: new Abstract: In this paper, we propose CORTEX, a token-level hallucination detection method for Retrieval-Augmented Generation (RAG). In long-form RAG outputs, hallucinations often arise in localized spans rather than throughout an entire… 20 arXiv — NLP / Computation & Language research 1d ago SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference arXiv:2606.31145v1 Announce Type: new Abstract: Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and must be retained throughout decoding, making full GPU caching… 11 arXiv — NLP / Computation & Language research 1d ago Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law arXiv:2606.31250v1 Announce Type: new Abstract: Large language models (LLM) trained on web-scale corpora generate output that may infringe copyright, yet existing technical safeguards focus narrowly on verbatim memorisation. EU copyright doctrine applies a broader standards:… 36 arXiv — NLP / Computation & Language research 1d ago CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield arXiv:2606.31796v1 Announce Type: new Abstract: We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output… 14 arXiv — NLP / Computation & Language research 1d ago Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity? arXiv:2606.31407v1 Announce Type: cross Abstract: Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows… 15 arXiv — NLP / Computation & Language research 1d ago Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models arXiv:2606.31511v1 Announce Type: cross Abstract: In deployment settings where retraining is infeasible, small frozen code models are routinely asked to repair a failed program after seeing their own failing output, usually treated as a retry mechanism. From a Popperian view, a… 9 arXiv — NLP / Computation & Language research 1d ago Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models arXiv:2410.12341v4 Announce Type: replace Abstract: As AI-generated content increasingly populates the web, generative AI models are at growing risk of being trained on their own outputs, a process known as AI autophagy. This feedback loop has been shown to induce model… 16 r/LocalLLaMA community 1d ago DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp Bartowski's DeepSeek-V4-Flash-MXFP4 GGUF, llama.cpp build 9851 ( 0eca4d490 ), deepseek4 arch. Ran the same n_ctx = 10240 , same n_ubatch = n_batch = 8192 , flash attention on — only difference is -ctk / -ctv : Cache type Total KV cache (CUDA0) CUDA0 compute buffer f16 (default,… 18 Vercel — AI dev-tools 1d ago Dry-run deployments with Vercel CLI You can now preview the framework preset and files that Vercel CLI includes in a deployment before creating one. Run vercel deploy --dry from a linked project: For automation or further inspection, return the complete file manifest as JSON: JSON output includes the detected… 17 r/LocalLLaMA community 1d ago Is there an alternative to C-Payne for 100-lane PCIe 5.0 switches? Needed for 8-GPU build. Sadly Christian is on vacation or something, which is a shame because the C-Payne PCIe gear is the best around. In the meantime I need this to add some urgent compute capacity:… 15 llama.cpp releases dev-tools 1d ago b9851 cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel ( #24945 ) Co-authored-by: Stanisław Szymczyk sszymczy@gmail.com macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED… 10 TechCrunch — AI news-outlet 1d ago Nvidia competitor Etched hits $5B valuation, $1B in sales for AI chip Nvidia AI chip competitor Etched says it has already booked $1 billion under contract for the inference systems powered by its chip. 19 NVIDIA Developer Blog official-blog 1d ago Designing GPU-Accelerated Query Engines with NVIDIA GQE GPU-accelerated query engines are often constrained by memory and I/O bandwidth. NVIDIA hardware advances—including high bandwidth memory (HBM), NVIDIA... 36 r/LocalLLaMA community 1d ago HIP: use hipBLAS for dense prefill on gfx900, keep MMQ for MoE by DEV-DUFORD · Pull Request #24588 · ggml-org/llama.cpp Overall Performance Gains: Qwen3.5 4B : +36.1% Qwen3.6 27B : +18.9% Gemma4 12B : +65.1% Overall average : ~40% Only for gfx900 related GPUs: Vega GPU, codename vega10, including Radeon Vega Frontier Edition, Radeon RX Vega 56/64, Radeon RX Vega 64 Liquid, Radeon Pro Vega… 5 NVIDIA Developer Blog official-blog 1d ago Optimizing a Neural Reconstruction Pipeline Using NVIDIA Nsight Developer Tools NVIDIA Ominverse NuRec is a neural reconstruction pipeline for building high-fidelity 3D representations of real-world environments from multisensor data such... 8 llama.cpp releases dev-tools 1d ago b9848 CUDA: fix get_rows_back for tables with more than 65535 rows (grid-y clamp + stride) ( #25103 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x… 9 llama.cpp releases dev-tools 1d ago b9847 CUDA: fix Gemma E4B MTP FlashAttention ( #25148 ) CUDA: fix Gemma E4B MTP FlashAttention remove unused template declaration macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU)… 16 r/LocalLLaMA community 1d ago nvidia/Qwen3.6-27B-NVFP4 just dropped https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4   submitted by   /u/vanbukin [link]   [comments] 37 llama.cpp releases dev-tools 1d ago b9844 ggml-webgpu: add support for NVFP4 ( #25143 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan)… 19 arXiv — Machine Learning research 2d ago What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs arXiv:2606.28615v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these… 31 arXiv — Machine Learning research 2d ago The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables arXiv:2606.28839v1 Announce Type: new Abstract: We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification… 38 arXiv — Machine Learning research 2d ago When Can Conformal Risk Control Certify LLM Outputs? Bounds, Impossibility, and Adaptation for Structured Generation arXiv:2606.29054v1 Announce Type: new Abstract: Large language models (LLMs) deployed for structured generation (NER, JSON extraction, QA, and classification) lack formal reliability guarantees, and standard heuristic abstention policies miss user-specified risk targets by… 4 arXiv — Machine Learning research 2d ago Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps arXiv:2606.29110v1 Announce Type: new Abstract: Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating… 32 arXiv — NLP / Computation & Language research 2d ago Beyond the Mean: Three-Axis Fidelity for Aligning LLM-Based Survey Simulators from Small Pilot Data arXiv:2606.28963v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to simulate social survey responses, yet their outputs exhibit systematic biases: marginal distributions are skewed, response variance is poorly calibrated, and predictor-outcome… 20 arXiv — NLP / Computation & Language research 2d ago Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks arXiv:2606.29082v1 Announce Type: new Abstract: Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on… 4 arXiv — NLP / Computation & Language research 2d ago AURORA: Asymmetry and Update-Induced Rotation for Robust Hallucination Detection in Large Language Models arXiv:2606.29545v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to generate hallucinations, namely factually incorrect or unfaithful outputs,… 27 arXiv — NLP / Computation & Language research 2d ago Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression arXiv:2606.29712v1 Announce Type: new Abstract: Large language models achieve high reasoning performance via explicit chain-of-thought and reinforcement learning, but require long output sequences and extended inference time. Latent reasoning reduces this cost by shifting… 22 arXiv — NLP / Computation & Language research 2d ago How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation arXiv:2606.29809v1 Announce Type: new Abstract: Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating… 27 r/LocalLLaMA community 2d ago Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.… 19 Page 1 of 10 · 500 articles Older →