Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — NLP / Computation & Language research 13d ago

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

arXiv:2606.20002v1 Announce Type: cross Abstract: This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it…

13
arXiv — NLP / Computation & Language research 13d ago

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

arXiv:2606.19659v1 Announce Type: new Abstract: On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on…

17
arXiv — NLP / Computation & Language research 13d ago

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

arXiv:2606.19847v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented…

32
arXiv — NLP / Computation & Language research 13d ago

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

arXiv:2606.19852v1 Announce Type: new Abstract: Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional…

26
arXiv — NLP / Computation & Language research 13d ago

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

arXiv:2606.20113v1 Announce Type: new Abstract: Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the…

21
arXiv — NLP / Computation & Language research 13d ago

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

arXiv:2606.20487v1 Announce Type: new Abstract: Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition…

16
arXiv — NLP / Computation & Language research 13d ago

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

arXiv:2606.19388v1 Announce Type: cross Abstract: Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct…

31
arXiv — NLP / Computation & Language research 13d ago

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

arXiv:2606.19501v1 Announce Type: cross Abstract: Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing…

14
arXiv — NLP / Computation & Language research 13d ago

Uncertainty Decomposition for Clarification Seeking in LLM Agents

arXiv:2606.19559v1 Announce Type: cross Abstract: Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable…

9
arXiv — NLP / Computation & Language research 13d ago

Benchmarking Agentic Review Systems

arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems…

15
arXiv — NLP / Computation & Language research 13d ago

Multi-Agent Transactive Memory

arXiv:2606.19911v1 Announce Type: cross Abstract: The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated…

24
arXiv — NLP / Computation & Language research 13d ago

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving…

17
arXiv — NLP / Computation & Language research 13d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

arXiv:2606.20529v1 Announce Type: cross Abstract: Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and…

27
arXiv — NLP / Computation & Language research 13d ago

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and…

22
Hugging Face Daily Papers research 13d ago

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

27
r/LocalLLaMA community 13d ago

GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval

  submitted by   /u/analysis_scaled [link]   [comments]

7
Hacker News — AI on Front Page community 13d ago

Zero-Touch OAuth for MCP

Article URL: https://blog.modelcontextprotocol.io/posts/enterprise-managed-auth/ Comments URL: https://news.ycombinator.com/item?id=48592163 Points: 202 # Comments: 66

17
Hugging Face official-blog 14d ago

MosaicLeaks: Can your research agent keep a secret?

Back to Articles a]:hidden"> MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung ServiceNow Rafael Pardinas rafapi-snow ServiceNow TL;DR Deep research agents increasingly combine private local documents…

24
r/LocalLLaMA community 14d ago

poolside/Laguna-M.1 · Hugging Face - 225B-A23B

Laguna M.1 Laguna M.1 is a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token designed for agentic coding and long-horizon work. Highlights Large sparse MoE for agentic coding : Laguna M.1 is a 70-layer MoE transformer with 225B total…

26
TechCrunch — AI news-outlet 14d ago

General Intuition in talks to raise $300M at around $2B valuation

General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning.

14
Hugging Face Daily Papers research 14d ago

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be…

6
Hugging Face Daily Papers research 14d ago

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and…

29
r/LocalLLaMA community 14d ago

gave my local llm agent mcp tools for local image + video gen, so it just generates when i ask (fully offline+free)

free and open source, runs fully offline. the local llm agent does the image and video gen itself via mcp tools. details and github in the comments.   submitted by   /u/GroundbreakingMall54 [link]   [comments]

33
Hugging Face Daily Papers research 14d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems…

28
r/LocalLLaMA community 14d ago

Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF · Hugging Face

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family. Highlights Outstanding Video Understanding and…

29
Hugging Face Daily Papers research 14d ago

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

Abstract RODS addresses sample depletion in multi-turn tool-use reinforcement learning by dynamically synthesizing new data based on reward variance to maintain informative training samples. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-turn tool-use RL is bottlenecked by…

21
r/LocalLLaMA community 14d ago

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it?

Yes I know this is a simple question I could just ask Claude or something but I want to see what the community suggests For context it’s a 16in MacBook Pro and i use Hermes agent as a harness connected to LM studio as obviously it’s preferable to be running MLX models especially…

4
Vercel — AI dev-tools 14d ago

The Agent Stack

Agents are designed to do almost any kind of work, from answering support tickets to writing code. No matter how complex the workload, how long it runs, or how many turns it takes to complete, every agent needs three core capabilities to operate: Agents need to connect to models…

16
Hugging Face Daily Papers research 14d ago

Native Active Perception as Reasoning for Omni-Modal Understanding

Abstract OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing. Generated by…

24
arXiv — NLP / Computation & Language research 14d ago

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

arXiv:2606.18284v1 Announce Type: cross Abstract: The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve,…

21
arXiv — NLP / Computation & Language research 14d ago

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

arXiv:2606.18388v1 Announce Type: cross Abstract: RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to…

34
arXiv — Machine Learning research 14d ago

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals,…

18
arXiv — Machine Learning research 14d ago

Stealthy World Model Manipulation via Data Poisoning

arXiv:2606.18697v1 Announce Type: new Abstract: Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack…

18
arXiv — Machine Learning research 14d ago

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

arXiv:2606.18820v1 Announce Type: new Abstract: Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational…

19
arXiv — NLP / Computation & Language research 14d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory…

22
arXiv — Machine Learning research 14d ago

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

arXiv:2606.18963v1 Announce Type: new Abstract: We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact,…

21
arXiv — Machine Learning research 14d ago

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

arXiv:2606.18967v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive…

25
arXiv — NLP / Computation & Language research 14d ago

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

arXiv:2606.18406v1 Announce Type: new Abstract: Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces…

11
arXiv — NLP / Computation & Language research 14d ago

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the…

19
arXiv — NLP / Computation & Language research 14d ago

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators…

37
arXiv — NLP / Computation & Language research 14d ago

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv:2606.18831v1 Announce Type: new Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a…

36
arXiv — NLP / Computation & Language research 14d ago

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

arXiv:2606.19308v1 Announce Type: new Abstract: Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm…

13
arXiv — NLP / Computation & Language research 14d ago

Learning User Simulators with Turing Rewards

arXiv:2606.19336v1 Announce Type: new Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by…

37
arXiv — NLP / Computation & Language research 14d ago

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

arXiv:2606.18264v1 Announce Type: cross Abstract: Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors…

8
arXiv — NLP / Computation & Language research 14d ago

CEO-Bench: Can Agents Play the Long Game?

arXiv:2606.18543v1 Announce Type: cross Abstract: Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain…

29
arXiv — NLP / Computation & Language research 14d ago

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

arXiv:2606.18947v1 Announce Type: cross Abstract: Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider…

20
arXiv — NLP / Computation & Language research 14d ago

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

arXiv:2603.00026v2 Announce Type: replace Abstract: Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may…

15
Hugging Face Daily Papers research 14d ago

CEO-Bench: Can Agents Play the Long Game?

Abstract CEO-Bench evaluates language model agents' ability to manage a simulated startup over 500 days, testing their proficiency in long-term planning, noise handling, adaptability, and multi-task coordination through a Python interface. Generated by…

5
Hugging Face Daily Papers research 14d ago

Guava: An Effective and Universal Harness for Embodied Manipulation

Abstract A harness framework for embodied tool use combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models trained on large-scale…

15
Hugging Face official-blog 14d ago

Is it agentic enough? Benchmarking open models on your own tooling

Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a…

26

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

Uncertainty Decomposition for Clarification Seeking in LLM Agents

Benchmarking Agentic Review Systems

Multi-Agent Transactive Memory

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval

Zero-Touch OAuth for MCP

MosaicLeaks: Can your research agent keep a secret?

poolside/Laguna-M.1 · Hugging Face - 225B-A23B

General Intuition in talks to raise $300M at around $2B valuation

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

gave my local llm agent mcp tools for local image + video gen, so it just generates when i ask (fully offline+free)

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF · Hugging Face

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it?

The Agent Stack

Native Active Perception as Reasoning for Omni-Modal Understanding

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

Stealthy World Model Manipulation via Data Poisoning

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

VISUALSKILL: Multimodal Skills for Computer-Use Agents

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Learning User Simulators with Turing Rewards

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

CEO-Bench: Can Agents Play the Long Game?

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

CEO-Bench: Can Agents Play the Long Game?

Guava: An Effective and Universal Harness for Embodied Manipulation

Is it agentic enough? Benchmarking open models on your own tooling