Tag

Gpu

500 articles archived under #gpu · RSS

arXiv — Machine Learning research 29d ago

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

arXiv:2606.03014v1 Announce Type: new Abstract: Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based…

22
arXiv — Machine Learning research 29d ago

Will Accurate Fields Mislead Photonic Design? FromGlobal Accuracy to Port Readout

arXiv:2606.03038v1 Announce Type: new Abstract: Neural field surrogates can accelerate photonic design loops, but a surrogate that looks accurate in global field error can still mis-rank candidate devices when the final decision depends on localized output-port readouts. This…

23
arXiv — NLP / Computation & Language research 29d ago

Greener Than Humans? Environmental Attitudes in Large Language Models

arXiv:2606.02741v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs.…

19
arXiv — NLP / Computation & Language research 29d ago

Chatbots Output Meaningful (but Problematic) Language

arXiv:2606.02973v1 Announce Type: new Abstract: Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning…

28
arXiv — NLP / Computation & Language research 29d ago

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

arXiv:2606.02981v1 Announce Type: new Abstract: Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance…

30
arXiv — NLP / Computation & Language research 29d ago

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

arXiv:2606.03357v1 Announce Type: new Abstract: When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that…

18
arXiv — NLP / Computation & Language research 29d ago

Consistency Training Can Entrench Misalignment

arXiv:2606.03810v1 Announce Type: new Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly…

31
Hugging Face Daily Papers research 29d ago

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

Abstract OmniDreams, a foundation generative world model trained from the Cosmos diffusion model, enables real-time action-conditioned video generation for autonomous driving policy evaluation in complex, unseen scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As…

23
r/LocalLLaMA community 29d ago

Why do we benchmark quants on perplexity and prose but never on tool call validity?

The mixed precision quant discussion here lately, MoE aware stuff that keeps shared experts and the edge layers at higher precision is great, but it's almost all measured against perplexity and general output quality. What I never see is structured output. Tool call JSON,…

37
Hacker News — AI on Front Page community 29d ago

Use your Nvidia GPU's VRAM as swap space on Linux

Article URL: https://github.com/c0dejedi/nbd-vram Comments URL: https://news.ycombinator.com/item?id=48377404 Points: 225 # Comments: 65

18
r/LocalLLaMA community 1mo ago

Weird issue with OpenCode and Qwen3.6

I’m using Qwen3.6-27B running on my server with llama-server for AI coding with OpenCode. Sometimes for some reason, the response stops when its reasoning like if it has finished outputting the full response. I have to type “continue” and it continues working like if nothing…

30
llama.cpp releases dev-tools 1mo ago

b9483

hexagon: profiler output fix and script updates ( #24042 ) hex-ops: fix profiler output (ie remove the redundant NONEs) hex-prof: update profiling script to support tot.usec column macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED…

26
NVIDIA Developer Blog official-blog 1mo ago

Build Personal AI Agents on Windows PCs with New Tools from Microsoft and NVIDIA

AI agents are changing how you interact with your PC. Creators, developers, and AI enthusiasts are already using these agents extensively to assist with...

17
r/LocalLLaMA community 1mo ago

Would you consider getting an NVIDIA RTX Spark laptop?

If yes, why? If no, also say why. I’d consider one if it’s faster at local AI inference than my current hardware and can still handle gaming decently. The 128GB unified memory idea is pretty interesting, but I’m unsure about Windows on Arm and game compatibility. Would you buy…

27
The Information — AI news-outlet 1mo ago

Microsoft Debuts New Nvidia-Powered “Dev Box” PC; OS for Wearable and Desktop “Agent First” Devices

Microsoft on Tuesday unveiled a new desktop PC powered by an Nvidia processor geared towards developers who want to run AI models locally. The new PC, called the Surface RTX Spark Dev Box, is powered by Nvidia’s new Arm-based chips announced earlier this week. Microsoft…

14
r/LocalLLaMA community 1mo ago

Why don't we still have any games with AI agents used as NPC characters?

Do you remember this NVIDIA AI-NPC presentation from 3 years ago? https://www.youtube.com/watch?v=5R8xZb6J3r0 Where all of that? Why do we even try getting agents to do all the work if they still cannot be reliably used as a characters in the video games? Isn't it should be the…

11
r/LocalLLaMA community 1mo ago

I Put a Datacenter GPU in My Gaming PC for £200

Hey there! I wrote a blogpost about my experience running local models on a V100 from a newbie perspective and got loads of views outside of reddit, so I thought I'd share it here too!   submitted by   /u/tymscar [link]   [comments]

33
r/LocalLLaMA community 1mo ago

Benchmarks of 20 small LLMs on a 6GB RTX 4050

I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to the LM studio database and searched many variants from the same family, trying to select…

37
NVIDIA Developer Blog official-blog 1mo ago

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw

AI agents are a powerful tool for synthesizing data to accelerate research, summarize information, and help teams make decisions faster. But combining internal...

29
Hugging Face Daily Papers research 1mo ago

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Abstract Watermarking AI-generated text for detection fails when multiple models are used, as averaging outputs cancels perturbations and suppresses detection while improving quality and speed. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Watermarking embeds statistical…

4
r/LocalLLaMA community 1mo ago

What are you using to preprocess pdfs before feeding them to a local model?

I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever…

37
Hugging Face Daily Papers research 1mo ago

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

Abstract Human-AI collaboration in question-answering tasks reveals suboptimal reliance decisions where humans under-rely on correct AI suggestions and over-rely when AI misleads them, with confirmation bias contributing to reduced trust in conflicting AI outputs. Generated by…

25
Hugging Face Daily Papers research 1mo ago

Can Predicted Dynamics Exist in the Physical World?

Abstract Physical admissibility validation for AI systems uses prediction-control interfaces with kinematic and dynamic conditions to filter invalid proposals while maintaining high performance. AI-generated summary Predictive Physical AI systems output state rollouts, action…

33
r/LocalLLaMA community 1mo ago

nvidia-LocateAnything-3B detects sushi as sweet in the video demo

https://preview.redd.it/xc0l68bj7t4h1.png?width=616&format=png&auto=webp&s=48a8b14bc4ae95700cd4efa76772f4e71fb2d41a https://huggingface.co/nvidia/LocateAnything-3B funny how they left this in the demo atleast it's honest   submitted by   /u/chocofoxy [link]  …

7
r/LocalLLaMA community 1mo ago

NVIDIA releases Cosmos 3 Omnimodal world modelson HF

https://huggingface.co/nvidia/Cosmos3-Super-Text2Image Nano: 16B Super: 64B Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory…

7
r/LocalLLaMA community 1mo ago

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

Moss tts 1.5 8b is better than fish audio s2 pro and qwen 3 tts voice clone tts. You can easily get more better quality if you set up the duration of the voice in output you want and some temperature and other changes. This was just used on default setting. It can be improved…

20
arXiv — Machine Learning research 1mo ago

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

arXiv:2606.00144v1 Announce Type: new Abstract: Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU…

33
arXiv — Machine Learning research 1mo ago

The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition

arXiv:2606.00545v1 Announce Type: new Abstract: Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citep{jack2026twomodes} we showed they can also recognize when they are currently acting on-policy, through…

31
arXiv — Machine Learning research 1mo ago

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

arXiv:2606.00566v1 Announce Type: new Abstract: As language models take on agentic roles that span calling external APIs, reading tool outputs, and acting on instructions embedded in third-party content, their attack surface expands well beyond what users type. Whether a model…

14
arXiv — NLP / Computation & Language research 1mo ago

Linguistics-Aware Non-Distortionary LLM Watermarking

arXiv:2606.00613v1 Announce Type: new Abstract: Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where…

37
arXiv — NLP / Computation & Language research 1mo ago

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

arXiv:2606.00722v1 Announce Type: new Abstract: Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have…

19
arXiv — NLP / Computation & Language research 1mo ago

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

arXiv:2606.00971v1 Announce Type: new Abstract: Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability…

23
arXiv — NLP / Computation & Language research 1mo ago

A Finite-Calibration Regime Map for LLM Judge Panels

arXiv:2606.01034v1 Announce Type: new Abstract: We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas…

24
arXiv — NLP / Computation & Language research 1mo ago

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression

arXiv:2606.01074v1 Announce Type: new Abstract: Recent high-performing text embedding models often output high-dimensional real-valued vectors, resulting in substantial storage and computational costs. To address this issue, compression methods based on dimensionality reduction…

18
Latent.Space news-outlet 1mo ago

[AINews] NVIDIA Cosmos 3, Nemotron 3 Ultra, and RTX Spark

Jensen scores a huge win.

27
Hugging Face Daily Papers research 1mo ago

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Abstract Automated systems for generating scientific figures face limitations in handling diverse figure types and conditions, prompting the development of multi-agent frameworks that generalize across different input scenarios and produce editable output formats. AI-generated…

29
NVIDIA Developer Blog official-blog 1mo ago

Deploy Agentic-Ready AI at the Edge with Memory Efficiency in NVIDIA JetPack 7.2

As AI agents move from the digital world to the physical environment, they can readily use NVIDIA Jetson to accelerate real-world deployment with optimized...

24
NVIDIA Developer Blog official-blog 1mo ago

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark

The rise of autonomous, long-running AI agents has introduced a new class of compute demand, namely tasks that maintain large context windows, spawn concurrent...

16
TechCrunch — AI news-outlet 1mo ago

Nvidia chases $200B CPU market with AI agent PCs from Microsoft, Dell, and HP

If Nvidia has cracked a way to bring AI agents easily, safely and usefully to the masses, it could — and should — be big.

38
r/LocalLLaMA community 1mo ago

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix https://github.com/ggml-org/llama.cpp/releases/tag/b9455 JohannesGaessler commented 5 days ago This PR implements support for the combination of -sm…

11
llama.cpp releases dev-tools 1mo ago

b9464

speculative : fix n_outputs_max and remove draft-simple auto-enable ( #23988 ) speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in…

7
r/LocalLLaMA community 1mo ago

NVIDIA GB300 Grace Blackwell Ultra pricetags

https://www.scan.co.uk/shop/ai-and-robotics/workstations-ai/nvidia-dgx-station   submitted by   /u/X-N2O [link]   [comments]

5
llama.cpp releases dev-tools 1mo ago

b9460

llama: limit max outputs of llama_context ( #23861 ) llama: save more VRAM by reserving n_outputs == n_seqs when possible add n_outputs_per_seq move n_outputs_max to server-context change ubatch to batch everywhere macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…

15
r/LocalLLaMA community 1mo ago

Computex 2026: Intel launches Crescent Island GPU with up to 480GB VRAM

https://www.neowin.net/news/computex-2026-intel-launches-crescent-island-gpu-with-up-to-480gb-vram/ Crescent Island is based on the company"s Arc Xe 3P architecture which lies inside current Panther Lake iGPs as well. This is Intel"s latest, most powerful card and it packs up to…

36
Ollama releases dev-tools 1mo ago

v0.30.0-rc32: llama-server followups (#16353)

llama-server followups Misc fixes for #16031 Add back dropped ROCm build flag for multi-GPU support on windows Fix amdhip64_*.dll version detection for "latest" selection Fix embeddings API for consistent normalize behavior with prior versions ci: set up for automated llama.cpp…

19
The Information — AI news-outlet 1mo ago

OpenAI Could Release Internal Tool That Would Weaken Nvidia’s Software Advantage

Morning! Anissa here. OpenAI is open to the idea of publicly sharing software it’s been developing to make its AI run on chips from different providers, a senior executive said, a move that could weaken one of Nvidia’s biggest advantages. The revealing comments came from Sachin…

24
r/LocalLLaMA community 1mo ago

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

Overview continue #23764 , this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now…

22
llama.cpp releases dev-tools 1mo ago

b9452

vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints ( #23056 ) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from…

4
r/LocalLLaMA community 1mo ago

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:…

24
r/MachineLearning community 1mo ago

5060 Ti 16GB or Cloud: Which makes more sense for DL, RL, and LLM studies/research? [D]

Hi everyone, If you have purchased (at least one) GPU(s) for ML/DL studies and research: How is your experience and is it worth it? What do you use it for and how is the ROI? I have a MacBook Pro with M4 from some years ago, while MPS is useful in many occasions, it's no…

29

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

Will Accurate Fields Mislead Photonic Design? FromGlobal Accuracy to Port Readout

Greener Than Humans? Environmental Attitudes in Large Language Models

Chatbots Output Meaningful (but Problematic) Language

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

Consistency Training Can Entrench Misalignment

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

Why do we benchmark quants on perplexity and prose but never on tool call validity?

Use your Nvidia GPU's VRAM as swap space on Linux

Weird issue with OpenCode and Qwen3.6

b9483

Build Personal AI Agents on Windows PCs with New Tools from Microsoft and NVIDIA

Would you consider getting an NVIDIA RTX Spark laptop?

Microsoft Debuts New Nvidia-Powered “Dev Box” PC; OS for Wearable and Desktop “Agent First” Devices

Why don't we still have any games with AI agents used as NPC characters?

I Put a Datacenter GPU in My Gaming PC for £200

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

What are you using to preprocess pdfs before feeding them to a local model?

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

Can Predicted Dynamics Exist in the Physical World?

nvidia-LocateAnything-3B detects sushi as sweet in the video demo

NVIDIA releases Cosmos 3 Omnimodal world modelson HF

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

Linguistics-Aware Non-Distortionary LLM Watermarking

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

A Finite-Calibration Regime Map for LLM Judge Panels

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression

[AINews] NVIDIA Cosmos 3, Nemotron 3 Ultra, and RTX Spark

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Deploy Agentic-Ready AI at the Edge with Memory Efficiency in NVIDIA JetPack 7.2

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark

Nvidia chases $200B CPU market with AI agent PCs from Microsoft, Dell, and HP

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

b9464

NVIDIA GB300 Grace Blackwell Ultra pricetags

b9460

Computex 2026: Intel launches Crescent Island GPU with up to 480GB VRAM

v0.30.0-rc32: llama-server followups (#16353)

OpenAI Could Release Internal Tool That Would Weaken Nvidia’s Software Advantage

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

b9452

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

5060 Ti 16GB or Cloud: Which makes more sense for DL, RL, and LLM studies/research? [D]