News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — Machine Learning research 2d ago A Linear Matching Bandit Approach to Online Multi-Human Multi-Robot Teaming arXiv:2606.29221v1 Announce Type: new Abstract: We address the problem of online multi-human multi-robot teaming through the lens of a linear matching bandit framework, where a learner assigns robots with unknown features from a fixed pool to distinct sets of human agents over… 15 arXiv — Machine Learning research 2d ago Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning arXiv:2606.29280v1 Announce Type: new Abstract: We identify intervention bias as a previously unquantified failure mode of zero-shot large-language-model (LLM) educational advisory agents: without task-specific training, they recommend action when a hindsight-optimal oracle… 31 arXiv — Machine Learning research 2d ago Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents arXiv:2606.29459v1 Announce Type: new Abstract: Inverse design of metal-organic frameworks (MOFs) requires searching a combinatorially vast space where property labels are expensive and most machine-learning models reveal little about why a structure succeeds. We introduce… 8 arXiv — Machine Learning research 2d ago CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning arXiv:2606.29476v1 Announce Type: new Abstract: Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a… 24 arXiv — NLP / Computation & Language research 2d ago Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models arXiv:2606.28524v1 Announce Type: new Abstract: Recent work suggests that Large Language Models (LLMs) are sensitive to the belief states of agents described by text, as measured by the false belief task (FBT), yet persistent concerns of construct validity remain. We adopt a… 25 arXiv — NLP / Computation & Language research 2d ago SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce… 28 arXiv — NLP / Computation & Language research 2d ago MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling arXiv:2606.29265v1 Announce Type: new Abstract: Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents,… 17 arXiv — NLP / Computation & Language research 2d ago Hybrid Retriever Evolution for Multimodal Document Reasoning Agents arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to… 33 arXiv — NLP / Computation & Language research 2d ago SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution arXiv:2606.29713v1 Announce Type: new Abstract: Hallucination is the reliability bottleneck for LLM-based agents, and fact attribution verifiers are the last line of defense -- yet today's verifiers emit only opaque binary labels, leaving agents unable to self-correct and… 24 arXiv — NLP / Computation & Language research 2d ago Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering arXiv:2606.29824v1 Announce Type: new Abstract: While Large Language Models (LLMs) excel as static solvers, transforming them into autonomous agents remains challenging. This transition requires continuous environmental interaction, yet current agents lack the necessary… 17 arXiv — NLP / Computation & Language research 2d ago KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search arXiv:2606.29863v1 Announce Type: new Abstract: Agentic search equips large language models with dynamic retrieval abilities, but existing reinforcement learning methods remain limited by reward sparsity in knowledge boundary calibration -- deciding when to trust parametric… 38 arXiv — NLP / Computation & Language research 2d ago MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it… 4 arXiv — NLP / Computation & Language research 2d ago Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios? arXiv:2606.29920v1 Announce Type: new Abstract: Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is… 17 arXiv — NLP / Computation & Language research 2d ago LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard arXiv:2606.30005v1 Announce Type: new Abstract: Long-horizon tool agents are bottlenecked by how their context grows toward the limits of the context window. Recent systems make context management agent- or system-controlled, but they either learn a compression policy that… 34 Hugging Face Daily Papers research 2d ago TACO: Tool-Augmented Credit Optimization for Agentic Tool Use Abstract Tool-Augmented Credit Optimization (TACO) improves multimodal agent performance by distinguishing useful, redundant, or misleading code operations through dual advantage channels: Differential Answer-Probe Reward for individual tool contribution and Outcome-Gated… 38 Hugging Face Daily Papers research 2d ago Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent Abstract Agents-A1, a 35B Mixture-of-Experts Agentic Model, achieves trillion-parameter-level performance through long-horizon trajectory scaling and heterogeneous agent ability scaling via a three-stage training approach involving supervised fine-tuning, domain-level teacher… 28 Hugging Face Daily Papers research 2d ago PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents Abstract POLICYGUARD is a sub-agent verifier that enhances LLM agent policy adherence by providing contextual reasoning and conversation-specific feedback across multi-turn interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents handle user requests on behalf of… 11 Hugging Face Daily Papers research 2d ago GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots Abstract GUICrafter addresses GUI agent data challenges through a weakly-supervised approach using unannotated screenshots and a two-stage curriculum learning framework for visual grounding and reinforcement learning calibration. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 10 Hugging Face Daily Papers research 2d ago Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction Abstract A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks. Generated by… 28 Vercel — AI dev-tools 2d ago Claude Sonnet 5 now available on Vercel AI Gateway Claude Sonnet 5 from Anthropic is now available on AI Gateway . Sonnet 5 improves on Sonnet 4.6 across coding and agentic work, reaching outcomes on many tasks that previously needed an Opus model, at Sonnet pricing. The model is more agentic and follows instructions more… 14 Vercel — AI dev-tools 2d ago Vercel Private Blob is now generally available Vercel Private Blob is now generally available for all plans. Store sensitive files like user-uploaded photos, invoices, and agent memory, and control exactly who can read them. Private stores, Signed URLs, and OIDC authentication all graduate from beta with this release. Vercel… 22 Vercel — AI dev-tools 2d ago An expanded Vercel Agent: chat, investigations, and approved actions, now in public beta Today, we're launching expanded capabilities for Vercel Agent in public beta. Vercel Agent now lives in your dashboard and can investigate production issues, answer questions about your projects, and take action on your behalf. Because Agent runs inside the platform that deploys… 27 Vercel — AI dev-tools 2d ago Vercel Agent has updated pricing Vercel Agent pricing is changing. You no longer need to pre-load credits or manage a separate wallet for Vercel Agent. Instead of a $0.30 per-request fee, you now pay a Vercel Token Rate of $0.25 per million tokens, plus provider inference costs at the underlying token rate. The… 6 MIT Technology Review — AI news-outlet 2d ago AI agents are not your “coworkers” This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. Imagine coming in to work to learn that a new underling will report to you. The worker is not a person but an AI tool—one that your company… 6 Hacker News — AI on Front Page community 2d ago Ornith-1.0: self-improving open-source models for agentic coding Article URL: https://github.com/deepreinforce-ai/Ornith-1 Comments URL: https://news.ycombinator.com/item?id=48722052 Points: 215 # Comments: 39 31 TechCrunch — AI news-outlet 2d ago Cursor now has a mobile app for guiding your coding agent on the go Cursor has launched a new mobile app for remote oversight over coding agents. 29 Simon Willison community 2d ago Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding This is an interesting new open weights (MIT licensed) model, the first model release from DeepReinforce. [...] with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. Built on top of pretrained Gemma 4 and Qwen… 5 NVIDIA Developer Blog official-blog 2d ago How to Govern Autonomous Agents in Enterprise AI Factories AI agents are quickly moving beyond chat. They inspect code, run tests, read documents, search knowledge bases, query internal systems, and operate for hours on... 28 MIT Technology Review — AI news-outlet 2d ago Agent confidence on the technical frontier Enterprise investment in AI is booming. Gartner is calling 2026 an “inflection year” for organizations to align their AI projects with strategic business objectives. As the pressure to prove ROI mounts, executives and technology leaders are looking to agentic AI to drive the… 19 r/LocalLLaMA community 2d ago Anyone else end up building a web access layer for local AI agents? I've been running local models for most of my experiments, and I kept running into the same issue. The model lives locally, but everything it needs to interact with doesn't. Every new agent ended up with another GitHub client, another Reddit integration, another documentation… 10 r/MachineLearning community 3d ago Google's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R] Google deployed an agentic AI peer-reviewer at two top CS conferences — reviewing ~10,000 papers with 30-minute turnaround — and the new formal research paper shows it catches 34% more mathematical errors than zero-shot prompting; the precedent for AI-automated scientific review… 23 Vercel — AI dev-tools 3d ago Build realtime voice agents on AI Gateway AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each… 26 Smol AI News news-outlet 3d ago not much happened today **Meta** announced **Brain2Qwerty v2**, a real-time non-invasive brain-to-text decoder achieving up to **78% word accuracy** with released training code and dataset. **Cursor** launched **Cursor for iOS** with remote AI agents and live activity features. Open-weight model access… 35 arXiv — Machine Learning research 3d ago Training Observable Control Policies to Expose Agent State Through Actions arXiv:2606.27609v1 Announce Type: new Abstract: Physical or operational constraints often impose communications limitations on autonomous agents. Such limitations complicate monitoring or multiagent coordination. Even when strong communications are absent, some information may… 13 arXiv — Machine Learning research 3d ago COOPA: A Modular LLM Agent Architecture for Operations Research Problems arXiv:2606.27611v1 Announce Type: new Abstract: Operations Research (OR) provides a rigorous framework for high-stakes decision-making, but effective OR modeling requires substantial domain knowledge, mathematical abstraction, and solver expertise. Recent LLM-based systems… 18 arXiv — Machine Learning research 3d ago CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association arXiv:2606.28179v1 Announce Type: new Abstract: Identifying robust associations between cardiac imaging phenotypes and clinical diseases is fundamental to population-scale cardiovascular research and reliable risk stratification. However, current phenome-wide association studies… 13 arXiv — Machine Learning research 3d ago LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior arXiv:2606.28182v1 Announce Type: new Abstract: Embodied agents operating in decentralized and partially observable environments have attracted growing attention in recent years. However, existing large language model (LLM)-based agents often exhibit behaviors that are… 28 arXiv — Machine Learning research 3d ago Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives arXiv:2606.28217v1 Announce Type: new Abstract: We propose a framework for reward allocation in fully delegated AI cooperatives where humans are represented by agents that contribute data and participate in model updates under heterogeneous value constraints. The key idea is to… 32 arXiv — NLP / Computation & Language research 3d ago Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement arXiv:2606.27409v1 Announce Type: cross Abstract: Multi-agent large language model (LLM) systems often rely on verifier and critic agents to suppress hallucinations, but verification is delayed. During this delay, false claims can propagate through the agent network. We model… 25 arXiv — NLP / Computation & Language research 3d ago Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents arXiv:2606.27472v1 Announce Type: new Abstract: Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding… 16 arXiv — NLP / Computation & Language research 3d ago Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated,… 32 arXiv — NLP / Computation & Language research 3d ago When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume… 27 arXiv — NLP / Computation & Language research 3d ago DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write… 11 arXiv — NLP / Computation & Language research 3d ago AI Persuasive Framing in Collective Dilemmas arXiv:2606.27951v1 Announce Type: cross Abstract: AI agents are promising tools that can act as flexible behavioral nudges to enhance human cooperation in addressing large-scale societal problems. However, evidence on whether AI agents can effectively boost cooperation remains… 32 arXiv — NLP / Computation & Language research 3d ago Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety arXiv:2510.16492v4 Announce Type: replace Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn… 20 arXiv — NLP / Computation & Language research 3d ago Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification arXiv:2511.03217v2 Announce Type: replace Abstract: Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet… 4 arXiv — NLP / Computation & Language research 3d ago LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks arXiv:2604.13072v2 Announce Type: replace Abstract: OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments. Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be… 28 arXiv — NLP / Computation & Language research 3d ago EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from… 34 arXiv — NLP / Computation & Language research 3d ago Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using… 4 r/LocalLLaMA community 3d ago I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers. This is something I've been working on, I like playing around with smaller local models but found most agent harness's not well suited for them. The failure modes across different model family's tend to be the same: Failed tool calls Poor varication of environment variables Poor… 12 Page 3 of 10 · 500 articles ← Newer Older →