Tag

Agents + tool use

500 articles archived under #agents · RSS

llama.cpp releases dev-tools 15d ago

b9691

ggml-cpu: Conditionally enable power11 backend based on compiler support ( #24687 ) ggml: Conditionally enable power11 backend based on compiler support Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang…

14
r/LocalLLaMA community 15d ago

Lemonade v10.8: auto memory management, cloud offload, Omni improvements, and call your local models as MCP tools

v10.8 is out, so here's a project update on what landed. This was a 20-contributor release in just 7 days! Smarter memory and context management Dynamic VRAM management now auto-unloads idle models and downsizes their KV-cache to reclaim GPU memory on the fly, plus model pinning…

27
Ars Technica — AI news-outlet 15d ago

AI coding agents taught robots how to install GPUs and cut zip-ties

NVIDIA’s self-improvement program for robots enlists teams of AI coding agents.

13
TechCrunch — AI news-outlet 15d ago

NEA’s Tiffany Luck on AI IPOs, personal agents, and the ROI reckoning

Tokenmaxxing was the hottest trend in Silicon Valley earlier this year, with CEOs encouraging employees to push AI usage as far as it would go. Then the bill came due. Uber reportedly blew through its annual AI budget in a few months, some companies…

23
r/LocalLLaMA community 15d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

arXiv : https://arxiv.org/abs/2606.17861 Full Paper : https://arxiv.org/pdf/2606.17861 HuggingFace : https://huggingface.co/papers/2606.17861 GitHub : https://github.com/tongxuluo/gamecraft-bench Project : https://tongxuluo.github.io/gamecraft-bench-website/ I see big/large…

20
llama.cpp releases dev-tools 15d ago

b9685

[SYCL] add dev2dev memcpy by SYCL API ( #24476 ) add dev2dev memcpy by SYCL API mv GGML_SYCL_DEV2DEV_MEMCPY to runntime table update the detect method for p2p comm fix the erro created during fix confilct Co-authored-by: Neo Zhang macOS/iOS: macOS Apple Silicon (arm64) macOS…

33
Vercel — AI dev-tools 15d ago

Vercel Ship 2026 recap

For a decade, Vercel has shaped how the web gets built. Now, we’re doing the same for agents. The companies that win the next decade will build on infrastructure designed for agents from the start, and over 2,500 people gathered in London this week to do just that at Vercel Ship…

20
r/LocalLLaMA community 15d ago

GLM-5.2 is a win for local AI

I know GLM 5.2's massive 753B footprint means none of us are running it at home without an enterprise cluster, but having a true frontier-level, MIT-licensed coding agent out in the wild makes me optimistic. The distillation potential here is massive. Once the community starts…

38
r/LocalLLaMA community 15d ago

Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C

Some background so this is honest. Over the past few months I ran a lot of oneshot experiments with single file three.js games. Minecraft clones, that kind of thing. I picked those on purpose because they sit deep in the training data and are trivial to debug by eye. The goal…

37
Hugging Face Daily Papers research 15d ago

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Abstract DR-DCI framework combines retrieval with direct corpus interaction by dynamically pulling relevant documents into a local workspace, enabling scalable and efficient agentic search across large corpora. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search over…

27
Hugging Face Daily Papers research 15d ago

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Abstract Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) have demonstrated…

25
llama.cpp releases dev-tools 15d ago

b9674

SYCL: fix use-after-free bug with async memcpy in MoE prefill ( #24676 ) SYCL: fix a bug with async memcpy make mmid_row_mapping_host persistent comment on stream->wait Apply suggestion from @sanmai Apply suggestion from @sanmai Apply suggestion from @sanmai macOS/iOS: macOS…

34
Hugging Face official-blog 15d ago

From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot

Back to Articles a]:hidden"> From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot Enterprise Article Published June 17, 2026 Upvote 4 Sundar Raghavan rsundaraws amazon Cagatay Cali cagataydev amazon A walkthrough of the LeRobot integration in Strands…

28
arXiv — Machine Learning research 15d ago

ProCUA-SFT Technical Report

arXiv:2606.17321v1 Announce Type: new Abstract: Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest…

9
arXiv — Machine Learning research 15d ago

Offline Preference-Based Trajectory Evaluation

arXiv:2606.17541v1 Announce Type: new Abstract: Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective…

20
arXiv — NLP / Computation & Language research 15d ago

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

arXiv:2606.17680v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards.…

15
arXiv — NLP / Computation & Language research 15d ago

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

arXiv:2606.17162v1 Announce Type: new Abstract: Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn…

25
arXiv — NLP / Computation & Language research 15d ago

PromptMN: Pseudo Prompting Language

arXiv:2606.17164v1 Announce Type: new Abstract: Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic…

13
arXiv — NLP / Computation & Language research 15d ago

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

arXiv:2606.17519v1 Announce Type: new Abstract: Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed…

14
arXiv — NLP / Computation & Language research 15d ago

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

arXiv:2606.17628v1 Announce Type: new Abstract: Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate…

37
arXiv — NLP / Computation & Language research 15d ago

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

arXiv:2606.17838v1 Announce Type: new Abstract: LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes…

20
arXiv — NLP / Computation & Language research 15d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

arXiv:2606.17861v1 Announce Type: new Abstract: Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a…

28
arXiv — NLP / Computation & Language research 15d ago

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

arXiv:2606.18051v1 Announce Type: new Abstract: LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem:…

8
arXiv — NLP / Computation & Language research 15d ago

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

arXiv:2606.18203v1 Announce Type: new Abstract: The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an…

28
arXiv — NLP / Computation & Language research 15d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

arXiv:2606.18237v1 Announce Type: new Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale…

36
arXiv — NLP / Computation & Language research 15d ago

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

arXiv:2606.17092v1 Announce Type: cross Abstract: Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a…

8
arXiv — NLP / Computation & Language research 15d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

arXiv:2606.17389v1 Announce Type: cross Abstract: Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that…

24
arXiv — NLP / Computation & Language research 15d ago

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

arXiv:2606.17467v1 Announce Type: cross Abstract: Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with…

17
arXiv — NLP / Computation & Language research 15d ago

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

arXiv:2606.17645v1 Announce Type: cross Abstract: Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow…

32
arXiv — NLP / Computation & Language research 15d ago

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

arXiv:2606.17698v1 Announce Type: cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked.…

24
arXiv — NLP / Computation & Language research 15d ago

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

arXiv:2606.17799v1 Announce Type: cross Abstract: Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically…

33
arXiv — NLP / Computation & Language research 15d ago

A Framework for Evaluating Agentic Skills at Scale

arXiv:2606.17819v1 Announce Type: cross Abstract: Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain…

10
arXiv — NLP / Computation & Language research 15d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

arXiv:2606.18037v1 Announce Type: cross Abstract: Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually…

27
arXiv — NLP / Computation & Language research 15d ago

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

arXiv:2606.18060v1 Announce Type: cross Abstract: As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that…

13
arXiv — NLP / Computation & Language research 15d ago

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

arXiv:2606.18142v1 Announce Type: cross Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts,…

21
arXiv — NLP / Computation & Language research 15d ago

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

arXiv:2601.03872v2 Announce Type: replace Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool…

27
arXiv — NLP / Computation & Language research 15d ago

LVLMs and Humans Ground Differently in Referential Communication

arXiv:2601.19792v4 Announce Type: replace Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common…

9
Vercel — AI dev-tools 15d ago

Introducing Vercel Connect

Giving your agents access to your tools, data, and services is what makes them useful. As agents perform deeper work across systems, authenticating and authorizing that access becomes central to your application architecture. Today, agent access is usually granted through…

21
Vercel — AI dev-tools 15d ago

Introducing eve

Today, we are proud to introduce eve , an open-source agent framework for building, running, and scaling agents. eve is designed around the idea that building an agent should mean defining what it does without assembling all of the pieces that it needs to run in production.…

15
Hugging Face Daily Papers research 15d ago

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

Abstract MemSlides presents a hierarchical memory framework for personalized presentation agents that separates long-term user profiles, working memory for session constraints, and tool memory for reusable execution experiences to enable stable personalization and reliable local…

21
Hugging Face Daily Papers research 15d ago

ProCUA-SFT Technical Report

Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents…

4
Hugging Face Daily Papers research 15d ago

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Abstract OPD-Evolver is a self-evolving agent framework that combines slow-fast co-evolution with on-policy self-distillation to enhance memory management and policy learning across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory has become a standard…

28
Hugging Face Daily Papers research 15d ago

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

Abstract Research agents face significant challenges when evidence is in a different language than the query, with performance degrading even when gold evidence is provided directly. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep research agents are increasingly evaluated on…

28
Hugging Face Daily Papers research 15d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive…

31
Hugging Face Daily Papers research 15d ago

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Abstract LectūraAgents is a multi-agent framework that enables personalized learning through adaptive embodied teaching by mimicking professor-student interactions and generating coordinated teaching actions aligned with learner profiles. Generated by…

9
Vercel — AI dev-tools 15d ago

Introducing eve, an open-source agent framework

eve is now available in public preview. eve is an open-source framework for building, running, and scaling agents. An agent is just a directory of files, and production comes built in: Durable execution Sandboxed compute Human-in-the-loop approvals Subagents Evals The smallest…

31
Hugging Face official-blog 15d ago

Agentic Resource Discovery: Let agents search

Back to Articles a]:hidden"> Agentic Resource Discovery: Let agents search for tools, skills, and other agents. Published June 17, 2026 Update on GitHub Upvote - ben burtenshaw burtenshaw shaun smith evalstate If you build with agents today, you probably know three protocols.…

15
Vercel — AI dev-tools 15d ago

CLI deployment limits removed

We've removed CLI-specific deployment limits, making it easier to deploy from local machine and external CI/CD pipelines with instant feedback. Teams and AI agents can now deploy at the pace their workflows demand. Learn more about limits in the Documentation . Read more

5
Vercel — AI dev-tools 15d ago

Vercel for Enterprise Apps and Agents

Today we are introducing Vercel for Enterprise Apps and Agents , a platform that gives your entire company the ability to ship with AI safely, behind your access and security boundaries. Over the past year, employees across Vercel shipped hundreds of agents and internal apps.…

34
NVIDIA Developer Blog official-blog 15d ago

Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI

Developers building for AR glasses and wearable devices face an infrastructure gap. The hardware is ready, but creating AI experiences requires integrating live...

33

b9691

Lemonade v10.8: auto memory management, cloud offload, Omni improvements, and call your local models as MCP tools

AI coding agents taught robots how to install GPUs and cut zip-ties

NEA&#8217;s Tiffany Luck on AI IPOs, personal agents, and the ROI reckoning

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

b9685

Vercel Ship 2026 recap

GLM-5.2 is a win for local AI

Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

b9674

From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot

ProCUA-SFT Technical Report

Offline Preference-Based Trajectory Evaluation

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

PromptMN: Pseudo Prompting Language

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

A Framework for Evaluating Agentic Skills at Scale

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

LVLMs and Humans Ground Differently in Referential Communication

Introducing Vercel Connect

Introducing eve

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

ProCUA-SFT Technical Report

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Introducing eve, an open-source agent framework

Agentic Resource Discovery: Let agents search

CLI deployment limits removed

Vercel for Enterprise Apps and Agents

Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI

NEA’s Tiffany Luck on AI IPOs, personal agents, and the ROI reckoning