News / #gpu Tag Gpu 500 articles archived under #gpu · RSS Sign in to follow arXiv — Machine Learning research 29d ago MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency arXiv:2606.03014v1 Announce Type: new Abstract: Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based… 22 arXiv — Machine Learning research 29d ago Will Accurate Fields Mislead Photonic Design? FromGlobal Accuracy to Port Readout arXiv:2606.03038v1 Announce Type: new Abstract: Neural field surrogates can accelerate photonic design loops, but a surrogate that looks accurate in global field error can still mis-rank candidate devices when the final decision depends on localized output-port readouts. This… 23 arXiv — NLP / Computation & Language research 29d ago Greener Than Humans? Environmental Attitudes in Large Language Models arXiv:2606.02741v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs.… 19 arXiv — NLP / Computation & Language research 29d ago Chatbots Output Meaningful (but Problematic) Language arXiv:2606.02973v1 Announce Type: new Abstract: Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning… 28 arXiv — NLP / Computation & Language research 29d ago Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics arXiv:2606.02981v1 Announce Type: new Abstract: Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance… 30 arXiv — NLP / Computation & Language research 29d ago The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs arXiv:2606.03357v1 Announce Type: new Abstract: When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that… 18 arXiv — NLP / Computation & Language research 29d ago Consistency Training Can Entrench Misalignment arXiv:2606.03810v1 Announce Type: new Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly… 31 Hugging Face Daily Papers research 29d ago NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation Abstract OmniDreams, a foundation generative world model trained from the Cosmos diffusion model, enables real-time action-conditioned video generation for autonomous driving policy evaluation in complex, unseen scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As… 23 r/LocalLLaMA community 29d ago Why do we benchmark quants on perplexity and prose but never on tool call validity? The mixed precision quant discussion here lately, MoE aware stuff that keeps shared experts and the edge layers at higher precision is great, but it's almost all measured against perplexity and general output quality. What I never see is structured output. Tool call JSON,… 37 Hacker News — AI on Front Page community 29d ago Use your Nvidia GPU's VRAM as swap space on Linux Article URL: https://github.com/c0dejedi/nbd-vram Comments URL: https://news.ycombinator.com/item?id=48377404 Points: 225 # Comments: 65 18 r/LocalLLaMA community 1mo ago Weird issue with OpenCode and Qwen3.6 I’m using Qwen3.6-27B running on my server with llama-server for AI coding with OpenCode. Sometimes for some reason, the response stops when its reasoning like if it has finished outputting the full response. I have to type “continue” and it continues working like if nothing… 30 llama.cpp releases dev-tools 1mo ago b9483 hexagon: profiler output fix and script updates ( #24042 ) hex-ops: fix profiler output (ie remove the redundant NONEs) hex-prof: update profiling script to support tot.usec column macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED… 26 NVIDIA Developer Blog official-blog 1mo ago Build Personal AI Agents on Windows PCs with New Tools from Microsoft and NVIDIA AI agents are changing how you interact with your PC. Creators, developers, and AI enthusiasts are already using these agents extensively to assist with... 17 r/LocalLLaMA community 1mo ago Would you consider getting an NVIDIA RTX Spark laptop? If yes, why? If no, also say why. I’d consider one if it’s faster at local AI inference than my current hardware and can still handle gaming decently. The 128GB unified memory idea is pretty interesting, but I’m unsure about Windows on Arm and game compatibility. Would you buy… 27 The Information — AI news-outlet 1mo ago Microsoft Debuts New Nvidia-Powered “Dev Box” PC; OS for Wearable and Desktop “Agent First” Devices Microsoft on Tuesday unveiled a new desktop PC powered by an Nvidia processor geared towards developers who want to run AI models locally. The new PC, called the Surface RTX Spark Dev Box, is powered by Nvidia’s new Arm-based chips announced earlier this week. Microsoft… 14 r/LocalLLaMA community 1mo ago Why don't we still have any games with AI agents used as NPC characters? Do you remember this NVIDIA AI-NPC presentation from 3 years ago? https://www.youtube.com/watch?v=5R8xZb6J3r0 Where all of that? Why do we even try getting agents to do all the work if they still cannot be reliably used as a characters in the video games? Isn't it should be the… 11 r/LocalLLaMA community 1mo ago I Put a Datacenter GPU in My Gaming PC for £200 Hey there! I wrote a blogpost about my experience running local models on a V100 from a newbie perspective and got loads of views outside of reddit, so I thought I'd share it here too!   submitted by   /u/tymscar [link]   [comments] 33 r/LocalLLaMA community 1mo ago Benchmarks of 20 small LLMs on a 6GB RTX 4050 I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to the LM studio database and searched many variants from the same family, trying to select… 37 NVIDIA Developer Blog official-blog 1mo ago Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw AI agents are a powerful tool for synthesizing data to accelerate research, summarize information, and help teams make decisions faster. But combining internal... 29 Hugging Face Daily Papers research 1mo ago Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs Abstract Watermarking AI-generated text for detection fails when multiple models are used, as averaging outputs cancels perturbations and suppresses detection while improving quality and speed. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Watermarking embeds statistical… 4 r/LocalLLaMA community 1mo ago What are you using to preprocess pdfs before feeding them to a local model? I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever… 37 Hugging Face Daily Papers research 1mo ago AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering? Abstract Human-AI collaboration in question-answering tasks reveals suboptimal reliance decisions where humans under-rely on correct AI suggestions and over-rely when AI misleads them, with confirmation bias contributing to reduced trust in conflicting AI outputs. Generated by… 25 Hugging Face Daily Papers research 1mo ago Can Predicted Dynamics Exist in the Physical World? Abstract Physical admissibility validation for AI systems uses prediction-control interfaces with kinematic and dynamic conditions to filter invalid proposals while maintaining high performance. AI-generated summary Predictive Physical AI systems output state rollouts, action… 33 r/LocalLLaMA community 1mo ago nvidia-LocateAnything-3B detects sushi as sweet in the video demo https://preview.redd.it/xc0l68bj7t4h1.png?width=616&format=png&auto=webp&s=48a8b14bc4ae95700cd4efa76772f4e71fb2d41a https://huggingface.co/nvidia/LocateAnything-3B funny how they left this in the demo atleast it's honest   submitted by   /u/chocofoxy [link]  … 7 r/LocalLLaMA community 1mo ago NVIDIA releases Cosmos 3 Omnimodal world modelson HF https://huggingface.co/nvidia/Cosmos3-Super-Text2Image Nano: 16B Super: 64B Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory… 7 r/LocalLLaMA community 1mo ago Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026 Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved… 20 arXiv — Machine Learning research 1mo ago BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding arXiv:2606.00144v1 Announce Type: new Abstract: Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU… 33 arXiv — Machine Learning research 1mo ago The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition arXiv:2606.00545v1 Announce Type: new Abstract: Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citep{jack2026twomodes} we showed they can also recognize when they are currently acting on-policy, through… 31 arXiv — Machine Learning research 1mo ago Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models arXiv:2606.00566v1 Announce Type: new Abstract: As language models take on agentic roles that span calling external APIs, reading tool outputs, and acting on instructions embedded in third-party content, their attack surface expands well beyond what users type. Whether a model… 14 arXiv — NLP / Computation & Language research 1mo ago Linguistics-Aware Non-Distortionary LLM Watermarking arXiv:2606.00613v1 Announce Type: new Abstract: Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where… 37 arXiv — NLP / Computation & Language research 1mo ago EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models arXiv:2606.00722v1 Announce Type: new Abstract: Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have… 19 arXiv — NLP / Computation & Language research 1mo ago HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering arXiv:2606.00971v1 Announce Type: new Abstract: Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability… 23 arXiv — NLP / Computation & Language research 1mo ago A Finite-Calibration Regime Map for LLM Judge Panels arXiv:2606.01034v1 Announce Type: new Abstract: We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas… 24 arXiv — NLP / Computation & Language research 1mo ago When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression arXiv:2606.01074v1 Announce Type: new Abstract: Recent high-performing text embedding models often output high-dimensional real-valued vectors, resulting in substantial storage and computational costs. To address this issue, compression methods based on dimensionality reduction… 18 Latent.Space news-outlet 1mo ago [AINews] NVIDIA Cosmos 3, Nemotron 3 Ultra, and RTX Spark Jensen scores a huge win. 27 Hugging Face Daily Papers research 1mo ago Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs Abstract Automated systems for generating scientific figures face limitations in handling diverse figure types and conditions, prompting the development of multi-agent frameworks that generalize across different input scenarios and produce editable output formats. AI-generated… 29 NVIDIA Developer Blog official-blog 1mo ago Deploy Agentic-Ready AI at the Edge with Memory Efficiency in NVIDIA JetPack 7.2 As AI agents move from the digital world to the physical environment, they can readily use NVIDIA Jetson to accelerate real-world deployment with optimized... 24 NVIDIA Developer Blog official-blog 1mo ago Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark The rise of autonomous, long-running AI agents has introduced a new class of compute demand, namely tasks that maintain large context windows, spawn concurrent... 16 TechCrunch — AI news-outlet 1mo ago Nvidia chases $200B CPU market with AI agent PCs from Microsoft, Dell, and HP If Nvidia has cracked a way to bring AI agents easily, safely and usefully to the masses, it could — and should — be big. 38 r/LocalLLaMA community 1mo ago ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix https://github.com/ggml-org/llama.cpp/releases/tag/b9455 JohannesGaessler commented 5 days ago This PR implements support for the combination of -sm… 11 llama.cpp releases dev-tools 1mo ago b9464 speculative : fix n_outputs_max and remove draft-simple auto-enable ( #23988 ) speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in… 7 r/LocalLLaMA community 1mo ago NVIDIA GB300 Grace Blackwell Ultra pricetags https://www.scan.co.uk/shop/ai-and-robotics/workstations-ai/nvidia-dgx-station   submitted by   /u/X-N2O [link]   [comments] 5 llama.cpp releases dev-tools 1mo ago b9460 llama: limit max outputs of llama_context ( #23861 ) llama: save more VRAM by reserving n_outputs == n_seqs when possible add n_outputs_per_seq move n_outputs_max to server-context change ubatch to batch everywhere macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon… 15 r/LocalLLaMA community 1mo ago Computex 2026: Intel launches Crescent Island GPU with up to 480GB VRAM https://www.neowin.net/news/computex-2026-intel-launches-crescent-island-gpu-with-up-to-480gb-vram/ Crescent Island is based on the company"s Arc Xe 3P architecture which lies inside current Panther Lake iGPs as well. This is Intel"s latest, most powerful card and it packs up to… 36 Ollama releases dev-tools 1mo ago v0.30.0-rc32: llama-server followups (#16353) llama-server followups Misc fixes for #16031 Add back dropped ROCm build flag for multi-GPU support on windows Fix amdhip64_*.dll version detection for "latest" selection Fix embeddings API for consistent normalize behavior with prior versions ci: set up for automated llama.cpp… 19 The Information — AI news-outlet 1mo ago OpenAI Could Release Internal Tool That Would Weaken Nvidia’s Software Advantage Morning! Anissa here. OpenAI is open to the idea of publicly sharing software it’s been developing to make its AI run on chips from different providers, a senior executive said, a move that could weaken one of Nvidia’s biggest advantages. The revealing comments came from Sachin… 24 r/LocalLLaMA community 1mo ago llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp Overview continue #23764 , this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now… 22 llama.cpp releases dev-tools 1mo ago b9452 vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints ( #23056 ) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from… 4 r/LocalLLaMA community 1mo ago mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:… 24 r/MachineLearning community 1mo ago 5060 Ti 16GB or Cloud: Which makes more sense for DL, RL, and LLM studies/research? [D] Hi everyone, If you have purchased (at least one) GPU(s) for ML/DL studies and research: How is your experience and is it worth it? What do you use it for and how is the ROI? I have a MacBook Pro with M4 from some years ago, while MPS is useful in many occasions, it's no… 29 Page 10 of 10 · 500 articles ← Newer