News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — NLP / Computation & Language research 13d ago Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning arXiv:2606.20002v1 Announce Type: cross Abstract: This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it… 13 arXiv — NLP / Computation & Language research 13d ago SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation arXiv:2606.19659v1 Announce Type: new Abstract: On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on… 17 arXiv — NLP / Computation & Language research 13d ago AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts arXiv:2606.19847v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented… 32 arXiv — NLP / Computation & Language research 13d ago Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives arXiv:2606.19852v1 Announce Type: new Abstract: Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional… 26 arXiv — NLP / Computation & Language research 13d ago When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation arXiv:2606.20113v1 Announce Type: new Abstract: Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the… 21 arXiv — NLP / Computation & Language research 13d ago Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems arXiv:2606.20487v1 Announce Type: new Abstract: Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition… 16 arXiv — NLP / Computation & Language research 13d ago Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen? arXiv:2606.19388v1 Announce Type: cross Abstract: Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct… 31 arXiv — NLP / Computation & Language research 13d ago DeXposure-Claw: An Agentic System for DeFi Risk Supervision arXiv:2606.19501v1 Announce Type: cross Abstract: Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing… 14 arXiv — NLP / Computation & Language research 13d ago Uncertainty Decomposition for Clarification Seeking in LLM Agents arXiv:2606.19559v1 Announce Type: cross Abstract: Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable… 9 arXiv — NLP / Computation & Language research 13d ago Benchmarking Agentic Review Systems arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems… 15 arXiv — NLP / Computation & Language research 13d ago Multi-Agent Transactive Memory arXiv:2606.19911v1 Announce Type: cross Abstract: The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated… 24 arXiv — NLP / Computation & Language research 13d ago When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving… 17 arXiv — NLP / Computation & Language research 13d ago LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents arXiv:2606.20529v1 Announce Type: cross Abstract: Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and… 27 arXiv — NLP / Computation & Language research 13d ago ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and… 22 Hugging Face Daily Papers research 13d ago Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 27 r/LocalLLaMA community 13d ago GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval   submitted by   /u/analysis_scaled [link]   [comments] 7 Hacker News — AI on Front Page community 13d ago Zero-Touch OAuth for MCP Article URL: https://blog.modelcontextprotocol.io/posts/enterprise-managed-auth/ Comments URL: https://news.ycombinator.com/item?id=48592163 Points: 202 # Comments: 66 17 Hugging Face official-blog 14d ago MosaicLeaks: Can your research agent keep a secret? Back to Articles a]:hidden"> MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung ServiceNow Rafael Pardinas rafapi-snow ServiceNow TL;DR Deep research agents increasingly combine private local documents… 24 r/LocalLLaMA community 14d ago poolside/Laguna-M.1 · Hugging Face - 225B-A23B Laguna M.1 Laguna M.1 is a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token designed for agentic coding and long-horizon work. Highlights Large sparse MoE for agentic coding : Laguna M.1 is a 70-layer MoE transformer with 225B total… 26 TechCrunch — AI news-outlet 14d ago General Intuition in talks to raise $300M at around $2B valuation General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning. 14 Hugging Face Daily Papers research 14d ago iOSWorld: A Benchmark for Personally Intelligent Phone Agents Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be… 6 Hugging Face Daily Papers research 14d ago MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and… 29 r/LocalLLaMA community 14d ago gave my local llm agent mcp tools for local image + video gen, so it just generates when i ask (fully offline+free) free and open source, runs fully offline. the local llm agent does the image and video gen itself via mcp tools. details and github in the comments.   submitted by   /u/GroundbreakingMall54 [link]   [comments] 33 Hugging Face Daily Papers research 14d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems… 28 r/LocalLLaMA community 14d ago Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF · Hugging Face Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family. Highlights Outstanding Video Understanding and… 29 Hugging Face Daily Papers research 14d ago RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents Abstract RODS addresses sample depletion in multi-turn tool-use reinforcement learning by dynamically synthesizing new data based on reward variance to maintain informative training samples. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-turn tool-use RL is bottlenecked by… 21 r/LocalLLaMA community 14d ago I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it? Yes I know this is a simple question I could just ask Claude or something but I want to see what the community suggests For context it’s a 16in MacBook Pro and i use Hermes agent as a harness connected to LM studio as obviously it’s preferable to be running MLX models especially… 4 Vercel — AI dev-tools 14d ago The Agent Stack Agents are designed to do almost any kind of work, from answering support tickets to writing code. No matter how complex the workload, how long it runs, or how many turns it takes to complete, every agent needs three core capabilities to operate: Agents need to connect to models… 16 Hugging Face Daily Papers research 14d ago Native Active Perception as Reasoning for Omni-Modal Understanding Abstract OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing. Generated by… 24 arXiv — NLP / Computation & Language research 14d ago Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier arXiv:2606.18284v1 Announce Type: cross Abstract: The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve,… 21 arXiv — NLP / Computation & Language research 14d ago LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents arXiv:2606.18388v1 Announce Type: cross Abstract: RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to… 34 arXiv — Machine Learning research 14d ago Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals,… 18 arXiv — Machine Learning research 14d ago Stealthy World Model Manipulation via Data Poisoning arXiv:2606.18697v1 Announce Type: new Abstract: Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack… 18 arXiv — Machine Learning research 14d ago Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets arXiv:2606.18820v1 Announce Type: new Abstract: Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational… 19 arXiv — NLP / Computation & Language research 14d ago GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory… 22 arXiv — Machine Learning research 14d ago Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards arXiv:2606.18963v1 Announce Type: new Abstract: We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact,… 21 arXiv — Machine Learning research 14d ago EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts arXiv:2606.18967v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive… 25 arXiv — NLP / Computation & Language research 14d ago CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents arXiv:2606.18406v1 Announce Type: new Abstract: Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces… 11 arXiv — NLP / Computation & Language research 14d ago VISUALSKILL: Multimodal Skills for Computer-Use Agents arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the… 19 arXiv — NLP / Computation & Language research 14d ago LegalWorld: A Life-Cycle Interactive Environment for Legal Agents arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators… 37 arXiv — NLP / Computation & Language research 14d ago Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning arXiv:2606.18831v1 Announce Type: new Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a… 36 arXiv — NLP / Computation & Language research 14d ago Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play arXiv:2606.19308v1 Announce Type: new Abstract: Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm… 13 arXiv — NLP / Computation & Language research 14d ago Learning User Simulators with Turing Rewards arXiv:2606.19336v1 Announce Type: new Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by… 37 arXiv — NLP / Computation & Language research 14d ago Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies arXiv:2606.18264v1 Announce Type: cross Abstract: Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors… 8 arXiv — NLP / Computation & Language research 14d ago CEO-Bench: Can Agents Play the Long Game? arXiv:2606.18543v1 Announce Type: cross Abstract: Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain… 29 arXiv — NLP / Computation & Language research 14d ago Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents arXiv:2606.18947v1 Announce Type: cross Abstract: Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider… 20 arXiv — NLP / Computation & Language research 14d ago ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents arXiv:2603.00026v2 Announce Type: replace Abstract: Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may… 15 Hugging Face Daily Papers research 14d ago CEO-Bench: Can Agents Play the Long Game? Abstract CEO-Bench evaluates language model agents' ability to manage a simulated startup over 500 days, testing their proficiency in long-term planning, noise handling, adaptability, and multi-task coordination through a Python interface. Generated by… 5 Hugging Face Daily Papers research 14d ago Guava: An Effective and Universal Harness for Embodied Manipulation Abstract A harness framework for embodied tool use combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models trained on large-scale… 15 Hugging Face official-blog 14d ago Is it agentic enough? Benchmarking open models on your own tooling Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a… 26 Page 8 of 10 · 500 articles ← Newer Older →