Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — Machine Learning research 2d ago

A Linear Matching Bandit Approach to Online Multi-Human Multi-Robot Teaming

arXiv:2606.29221v1 Announce Type: new Abstract: We address the problem of online multi-human multi-robot teaming through the lens of a linear matching bandit framework, where a learner assigns robots with unknown features from a fixed pool to distinct sets of human agents over…

15
arXiv — Machine Learning research 2d ago

Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning

arXiv:2606.29280v1 Announce Type: new Abstract: We identify intervention bias as a previously unquantified failure mode of zero-shot large-language-model (LLM) educational advisory agents: without task-specific training, they recommend action when a hindsight-optimal oracle…

31
arXiv — Machine Learning research 2d ago

Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

arXiv:2606.29459v1 Announce Type: new Abstract: Inverse design of metal-organic frameworks (MOFs) requires searching a combinatorially vast space where property labels are expensive and most machine-learning models reveal little about why a structure succeeds. We introduce…

8
arXiv — Machine Learning research 2d ago

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

arXiv:2606.29476v1 Announce Type: new Abstract: Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a…

24
arXiv — NLP / Computation & Language research 2d ago

Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models

arXiv:2606.28524v1 Announce Type: new Abstract: Recent work suggests that Large Language Models (LLMs) are sensitive to the belief states of agents described by text, as measured by the false belief task (FBT), yet persistent concerns of construct validity remain. We adopt a…

25
arXiv — NLP / Computation & Language research 2d ago

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce…

28
arXiv — NLP / Computation & Language research 2d ago

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

arXiv:2606.29265v1 Announce Type: new Abstract: Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents,…

17
arXiv — NLP / Computation & Language research 2d ago

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to…

33
arXiv — NLP / Computation & Language research 2d ago

SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution

arXiv:2606.29713v1 Announce Type: new Abstract: Hallucination is the reliability bottleneck for LLM-based agents, and fact attribution verifiers are the last line of defense -- yet today's verifiers emit only opaque binary labels, leaving agents unable to self-correct and…

24
arXiv — NLP / Computation & Language research 2d ago

Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

arXiv:2606.29824v1 Announce Type: new Abstract: While Large Language Models (LLMs) excel as static solvers, transforming them into autonomous agents remains challenging. This transition requires continuous environmental interaction, yet current agents lack the necessary…

17
arXiv — NLP / Computation & Language research 2d ago

KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search

arXiv:2606.29863v1 Announce Type: new Abstract: Agentic search equips large language models with dynamic retrieval abilities, but existing reinforcement learning methods remain limited by reward sparsity in knowledge boundary calibration -- deciding when to trust parametric…

38
arXiv — NLP / Computation & Language research 2d ago

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it…

4
arXiv — NLP / Computation & Language research 2d ago

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

arXiv:2606.29920v1 Announce Type: new Abstract: Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is…

17
arXiv — NLP / Computation & Language research 2d ago

LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard

arXiv:2606.30005v1 Announce Type: new Abstract: Long-horizon tool agents are bottlenecked by how their context grows toward the limits of the context window. Recent systems make context management agent- or system-controlled, but they either learn a compression policy that…

34
Hugging Face Daily Papers research 2d ago

TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

Abstract Tool-Augmented Credit Optimization (TACO) improves multimodal agent performance by distinguishing useful, redundant, or misleading code operations through dual advantage channels: Differential Answer-Probe Reward for individual tool contribution and Outcome-Gated…

38
Hugging Face Daily Papers research 2d ago

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Abstract Agents-A1, a 35B Mixture-of-Experts Agentic Model, achieves trillion-parameter-level performance through long-horizon trajectory scaling and heterogeneous agent ability scaling via a three-stage training approach involving supervised fine-tuning, domain-level teacher…

28
Hugging Face Daily Papers research 2d ago

PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents

Abstract POLICYGUARD is a sub-agent verifier that enhances LLM agent policy adherence by providing contextual reasoning and conversation-specific feedback across multi-turn interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents handle user requests on behalf of…

11
Hugging Face Daily Papers research 2d ago

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

Abstract GUICrafter addresses GUI agent data challenges through a weakly-supervised approach using unannotated screenshots and a two-stage curriculum learning framework for visual grounding and reinforcement learning calibration. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

10
Hugging Face Daily Papers research 2d ago

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Abstract A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks. Generated by…

28
Vercel — AI dev-tools 2d ago

Claude Sonnet 5 now available on Vercel AI Gateway

Claude Sonnet 5 from Anthropic is now available on AI Gateway . Sonnet 5 improves on Sonnet 4.6 across coding and agentic work, reaching outcomes on many tasks that previously needed an Opus model, at Sonnet pricing. The model is more agentic and follows instructions more…

14
Vercel — AI dev-tools 2d ago

Vercel Private Blob is now generally available

Vercel Private Blob is now generally available for all plans. Store sensitive files like user-uploaded photos, invoices, and agent memory, and control exactly who can read them. Private stores, Signed URLs, and OIDC authentication all graduate from beta with this release. Vercel…

22
Vercel — AI dev-tools 2d ago

An expanded Vercel Agent: chat, investigations, and approved actions, now in public beta

Today, we're launching expanded capabilities for Vercel Agent in public beta. Vercel Agent now lives in your dashboard and can investigate production issues, answer questions about your projects, and take action on your behalf. Because Agent runs inside the platform that deploys…

27
Vercel — AI dev-tools 2d ago

Vercel Agent has updated pricing

Vercel Agent pricing is changing. You no longer need to pre-load credits or manage a separate wallet for Vercel Agent. Instead of a $0.30 per-request fee, you now pay a Vercel Token Rate of $0.25 per million tokens, plus provider inference costs at the underlying token rate. The…

6
MIT Technology Review — AI news-outlet 2d ago

AI agents are not your “coworkers”

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. Imagine coming in to work to learn that a new underling will report to you. The worker is not a person but an AI tool—one that your company…

6
Hacker News — AI on Front Page community 2d ago

Ornith-1.0: self-improving open-source models for agentic coding

Article URL: https://github.com/deepreinforce-ai/Ornith-1 Comments URL: https://news.ycombinator.com/item?id=48722052 Points: 215 # Comments: 39

31
TechCrunch — AI news-outlet 2d ago

Cursor now has a mobile app for guiding your coding agent on the go

Cursor has launched a new mobile app for remote oversight over coding agents.

29
Simon Willison community 2d ago

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding This is an interesting new open weights (MIT licensed) model, the first model release from DeepReinforce. [...] with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. Built on top of pretrained Gemma 4 and Qwen…

5
NVIDIA Developer Blog official-blog 2d ago

How to Govern Autonomous Agents in Enterprise AI Factories

AI agents are quickly moving beyond chat. They inspect code, run tests, read documents, search knowledge bases, query internal systems, and operate for hours on...

28
MIT Technology Review — AI news-outlet 2d ago

Agent confidence on the technical frontier

Enterprise investment in AI is booming. Gartner is calling 2026 an “inflection year” for organizations to align their AI projects with strategic business objectives. As the pressure to prove ROI mounts, executives and technology leaders are looking to agentic AI to drive the…

19
r/LocalLLaMA community 2d ago

Anyone else end up building a web access layer for local AI agents?

I've been running local models for most of my experiments, and I kept running into the same issue. The model lives locally, but everything it needs to interact with doesn't. Every new agent ended up with another GitHub client, another Reddit integration, another documentation…

10
r/MachineLearning community 3d ago

Google's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R]

Google deployed an agentic AI peer-reviewer at two top CS conferences — reviewing ~10,000 papers with 30-minute turnaround — and the new formal research paper shows it catches 34% more mathematical errors than zero-shot prompting; the precedent for AI-automated scientific review…

23
Vercel — AI dev-tools 3d ago

Build realtime voice agents on AI Gateway

AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each…

26
Smol AI News news-outlet 3d ago

not much happened today

**Meta** announced **Brain2Qwerty v2**, a real-time non-invasive brain-to-text decoder achieving up to **78% word accuracy** with released training code and dataset. **Cursor** launched **Cursor for iOS** with remote AI agents and live activity features. Open-weight model access…

35
arXiv — Machine Learning research 3d ago

Training Observable Control Policies to Expose Agent State Through Actions

arXiv:2606.27609v1 Announce Type: new Abstract: Physical or operational constraints often impose communications limitations on autonomous agents. Such limitations complicate monitoring or multiagent coordination. Even when strong communications are absent, some information may…

13
arXiv — Machine Learning research 3d ago

COOPA: A Modular LLM Agent Architecture for Operations Research Problems

arXiv:2606.27611v1 Announce Type: new Abstract: Operations Research (OR) provides a rigorous framework for high-stakes decision-making, but effective OR modeling requires substantial domain knowledge, mathematical abstraction, and solver expertise. Recent LLM-based systems…

18
arXiv — Machine Learning research 3d ago

CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association

arXiv:2606.28179v1 Announce Type: new Abstract: Identifying robust associations between cardiac imaging phenotypes and clinical diseases is fundamental to population-scale cardiovascular research and reliable risk stratification. However, current phenome-wide association studies…

13
arXiv — Machine Learning research 3d ago

LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior

arXiv:2606.28182v1 Announce Type: new Abstract: Embodied agents operating in decentralized and partially observable environments have attracted growing attention in recent years. However, existing large language model (LLM)-based agents often exhibit behaviors that are…

28
arXiv — Machine Learning research 3d ago

Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives

arXiv:2606.28217v1 Announce Type: new Abstract: We propose a framework for reward allocation in fully delegated AI cooperatives where humans are represented by agents that contribute data and participate in model updates under heterogeneous value constraints. The key idea is to…

32
arXiv — NLP / Computation & Language research 3d ago

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

arXiv:2606.27409v1 Announce Type: cross Abstract: Multi-agent large language model (LLM) systems often rely on verifier and critic agents to suppress hallucinations, but verification is delayed. During this delay, false claims can propagate through the agent network. We model…

25
arXiv — NLP / Computation & Language research 3d ago

Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

arXiv:2606.27472v1 Announce Type: new Abstract: Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding…

16
arXiv — NLP / Computation & Language research 3d ago

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated,…

32
arXiv — NLP / Computation & Language research 3d ago

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume…

27
arXiv — NLP / Computation & Language research 3d ago

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write…

11
arXiv — NLP / Computation & Language research 3d ago

AI Persuasive Framing in Collective Dilemmas

arXiv:2606.27951v1 Announce Type: cross Abstract: AI agents are promising tools that can act as flexible behavioral nudges to enhance human cooperation in addressing large-scale societal problems. However, evidence on whether AI agents can effectively boost cooperation remains…

32
arXiv — NLP / Computation & Language research 3d ago

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

arXiv:2510.16492v4 Announce Type: replace Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn…

20
arXiv — NLP / Computation & Language research 3d ago

Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

arXiv:2511.03217v2 Announce Type: replace Abstract: Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet…

4
arXiv — NLP / Computation & Language research 3d ago

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

arXiv:2604.13072v2 Announce Type: replace Abstract: OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments. Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be…

28
arXiv — NLP / Computation & Language research 3d ago

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from…

34
arXiv — NLP / Computation & Language research 3d ago

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using…

4
r/LocalLLaMA community 3d ago

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers.

This is something I've been working on, I like playing around with smaller local models but found most agent harness's not well suited for them. The failure modes across different model family's tend to be the same: Failed tool calls Poor varication of environment variables Poor…

12

A Linear Matching Bandit Approach to Online Multi-Human Multi-Robot Teaming

Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning

Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution

Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard

TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Claude Sonnet 5 now available on Vercel AI Gateway

Vercel Private Blob is now generally available

An expanded Vercel Agent: chat, investigations, and approved actions, now in public beta

Vercel Agent has updated pricing

AI agents are not your &#8220;coworkers&#8221;

Ornith-1.0: self-improving open-source models for agentic coding

Cursor now has a mobile app for guiding your coding agent on the go

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

How to Govern Autonomous Agents in Enterprise AI Factories

Agent confidence on the technical frontier

Anyone else end up building a web access layer for local AI agents?

Google's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R]

Build realtime voice agents on AI Gateway

not much happened today

Training Observable Control Policies to Expose Agent State Through Actions

COOPA: A Modular LLM Agent Architecture for Operations Research Problems

CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association

LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior

Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

AI Persuasive Framing in Collective Dilemmas

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers.

AI agents are not your “coworkers”