News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — NLP / Computation & Language research 1d ago Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale arXiv:2606.30801v1 Announce Type: new Abstract: Personalization algorithms determine what content users encounter on online platforms. Auditing these systems is difficult because independent auditors have only black-box access to the algorithms, while personalization depends on… 37 arXiv — NLP / Computation & Language research 1d ago A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases arXiv:2606.31041v1 Announce Type: new Abstract: Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names,… 12 arXiv — NLP / Computation & Language research 1d ago Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with… 7 arXiv — NLP / Computation & Language research 1d ago Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas? arXiv:2606.31213v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed as moral advisors and agents, they need to address dilemmas between two competing values. However, existing research on LLMs with moral dilemmas overlooks a central aspect… 11 arXiv — NLP / Computation & Language research 1d ago When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue arXiv:2606.31307v1 Announce Type: new Abstract: Large language models used in task-oriented dialogue often produce fluent but unsafe responses when backend database calls fail, return empty results, or surface mismatched information, inventing venues, confirmations, or booking… 16 arXiv — NLP / Computation & Language research 1d ago FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents arXiv:2606.31522v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision… 19 arXiv — NLP / Computation & Language research 1d ago AutoTrainess: Teaching Language Models to Improve Language Models Autonomously arXiv:2606.31551v1 Announce Type: new Abstract: Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that… 37 arXiv — NLP / Computation & Language research 1d ago Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper… 25 arXiv — NLP / Computation & Language research 1d ago DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching arXiv:2606.31980v1 Announce Type: new Abstract: Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions… 36 arXiv — NLP / Computation & Language research 1d ago Generative Skill Composition for LLM Agents arXiv:2606.32025v1 Announce Type: new Abstract: Recent LLM agents benefit from skills for solving complex tasks. Skills encapsulate modular packages of procedural knowledge and instructions for performing specialized tasks, such as setting up a sandboxed environment, running a… 34 arXiv — NLP / Computation & Language research 1d ago Emergent Culture in Minimal LLM Systems arXiv:2606.30668v1 Announce Type: cross Abstract: What happens when LLM agents operate with no context outside a turn, minimal prompting, and simple tools? Inspired by swarm engineering, we give collectives of three agents the ability to send messages and manipulate a shared… 22 arXiv — NLP / Computation & Language research 1d ago HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite… 29 arXiv — NLP / Computation & Language research 1d ago Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents arXiv:2606.31270v1 Announce Type: cross Abstract: Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these… 20 arXiv — NLP / Computation & Language research 1d ago The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills arXiv:2606.31272v1 Announce Type: cross Abstract: AI agents increasingly acquire and execute skills at runtime: bundles of prompt instructions, executable code, and tool declarations fetched from marketplaces and other agents. Governing them needs a stable notion of skill… 16 arXiv — NLP / Computation & Language research 1d ago ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping arXiv:2606.31693v1 Announce Type: cross Abstract: The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation… 38 arXiv — NLP / Computation & Language research 1d ago MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments arXiv:2606.31966v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a… 4 arXiv — NLP / Computation & Language research 1d ago InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training arXiv:2601.04126v3 Announce Type: replace Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present… 29 Hugging Face Daily Papers research 1d ago Dockerless: Environment-Free Program Verifier for Coding Agents Abstract A Dockerless environment-free agentic patch verifier improves code patch evaluation accuracy and enables effective post-training without execution-based verification costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Program verifiers play a central role in training… 21 Hugging Face Daily Papers research 1d ago LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents Abstract LUMOS provides a semantic interaction layer that converts operating system metadata into machine-readable formats, enabling AI agents to interact more efficiently with computer interfaces than through traditional visual methods. Generated by… 34 Hugging Face Daily Papers research 1d ago OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks Abstract OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing computer-use benchmarks… 24 r/MachineLearning community 1d ago REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]   submitted by   /u/julian88888888 [link]   [comments] 13 Vercel — AI dev-tools 1d ago Vercel Security Dashboard is in private beta The Vercel Security Dashboard is now in private beta. It aggregates the security posture of every account and project on Vercel. As teams grow and coding agents make it easy to spin up new projects, small misconfigurations can add up quietly and quickly. Team members without… 26 Vercel — AI dev-tools 1d ago Enforce consistent code for agents and humans with konsistent konsistent is now open source. konsistent is a CLI linter for TypeScript codebases that enforces structural conventions, giving agents and humans the consistent context they need to implement features correctly. Deterministic, fast, and covers structural patterns that TypeScript… 6 r/LocalLLaMA community 1d ago Vibe Coding / Agentic workflow Hey folks. I know that vibe coding is frowned upon pretty solidly here, and I get that, but I’m not a programmer. I just don’t realistically have the time to learn python or C++ to the level I would need to to build some of the things I’d like to create. On a side note, I do… 28 TechCrunch — AI news-outlet 1d ago OpenClaw is finally available on Android and iOS The free open source agentic program is finally invading your phone. 21 r/LocalLLaMA community 1d ago Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance) Repo → huggingface.co/LordNeel/Agents-A1-GGUF I made GGUF quants of InternScience/Agents-A1 — a 35B Mixture of Experts agent model (Qwen3.5-MoE, ~3B active, 256 experts / 8+1 active, hybrid linear+full attention, 256K context). It's built for long-horizon search, tool-calling,… 27 Anthropic SDK (Python) releases dev-tools 1d ago v0.115.0 0.115.0 (2026-06-30) Full Changelog: v0.114.0...v0.115.0 Features api: add support for Managed Agents event delta streaming, agent overrides, reverse pagination, vault credential injection scoping, and agent and deployment webhook events ( 8c23f7e ) 24 Hugging Face official-blog 1d ago ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration Back to Articles a]:hidden"> ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration Enterprise Article Published June 30, 2026 Upvote - Raju Pavuluri rpavuluri ibm-research Rahul Krishna rkrsn ibm-research Srikanth Govindaraj Tamilselvam stamilse ibm-research… 13 r/LocalLLaMA community 1d ago The harness matters more than the model. A 27B behind good critics changed my mind. I saw someone test Qwen3.6-27B with a 3-critic harness. The harness included code review, test review and Playwright e2e. Each critic had context. The result was that the model is usable for coding work. This matches what I have come to believe from running agents in production.… 20 TechCrunch — AI news-outlet 1d ago Anthropic launches Claude Sonnet 5 as a cheaper way to run agents Anthropic’s Claude Sonnet 5 brings stronger agentic capabilities, lower pricing, and improved safety, positioning the model as a cheaper alternative to Opus, GPT-5.5, and Gemini Pro. 5 TechCrunch — AI news-outlet 1d ago Acti puts AI agents directly into your smartphone keyboard Startup Acti is betting the smartphone keyboard is the next home for AI assistants. Its new keyboard for iOS and Android works across apps and lets users create custom AI-powered shortcuts using natural language. 24 Simon Willison community 1d ago Have your agent record video demos of its work with shot-scraper video shot-scraper video is a new command introduced in today's shot-scraper 1.10 release which accepts a storyboard.yml file defining a routine to run against a web application and uses Playwright to record a video of that routine. I've written before about the importance of having… 31 Hugging Face Daily Papers research 1d ago SWE-Together: Evaluating Coding Agents in Interactive User Sessions Abstract SWE-Together is a multi-turn coding benchmark created from real user-agent interactions, featuring a reactive LLM simulator to evaluate agents based on both final correctness and interaction efficiency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Most coding-agent… 32 Hugging Face Daily Papers research 1d ago RocketSmith: Agentic Additive Manufacturing of High-Powered Rockets Abstract An agentic system using large language models automates high-power rocket design processes, enabling successful flight testing with consistent simulation results. Generated by Qwen/Qwen2.5-Coder-32B-Instruct RocketSmith is an agentic system which intelligently automates… 9 MIT News — AI research 1d ago Q&A: What is agentic AI today, and what do we want it to be? Computer scientist Phillip Isola cuts through the hype to explain how AI agents work and what the future might hold for this rapidly advancing technology. 19 Simon Willison community 1d ago shot-scraper 1.10 Release: shot-scraper 1.10 The big new feature is shot-scraper video storyboard.yml , described in detail in Have your agent record video demos of its work with shot-scraper video . Tags: shot-scraper 11 TechCrunch — AI news-outlet 1d ago X now offers an MCP server to make its platform easier for AI tools to use X has launched a hosted MCP server, making it easier for developers to connect AI applications with the company’s API. 24 r/LocalLLaMA community 1d ago I built an autonomous dev pipeline and ran the same project head to head: a 27B local on a modded 4090, then again on cheap cloud LLMs Hey everyone! I open-sourced something I've been working on called Lullabeast. It's an autonomous dev pipeline. You describe your project and planner, executor, and reviewer agents build it phase by phase against a real git repo. How it came to be: for the last year or so I've… 9 TechCrunch — AI news-outlet 1d ago Amazon launches new $1 billion FDE org, following OpenAI and Anthropic Engineers on the new team will embed within companies to deploy purpose-built agents, focusing on fast deployments and customer self-sufficiency. 28 r/LocalLLaMA community 1d ago I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy Been running agents locally for a while and kept hitting the same issue: the more tools I added, the worse the model got at picking the right one.. So I finally benchmarked it properly.. Setup: qwen3.5-class model on an M4 MacBook, 100 tools in the catalog. One run with the full… 23 Hugging Face Daily Papers research 1d ago Agentic Abstention: Do Agents Know When to Stop Instead of Act? Abstract Agentic abstention involves determining when an AI agent should cease interaction under uncertainty, requiring sequential decision-making across multiple environments and task types. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are expected to act over… 13 TechCrunch — AI news-outlet 1d ago Crypto exchange OKX wants AI agents to hire and pay each other OKX is bringing together payments, identity and reputation into a marketplace for AI agents. 29 Hugging Face Daily Papers research 2d ago TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents Abstract TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents. Generated by… 4 Smol AI News news-outlet 2d ago not much happened today **Anthropic** launched **Claude Sonnet 5** as its new default mid-tier frontier model, featuring a **1M-token context window**, enhanced agentic capabilities including planning, browser and terminal tool use, and autonomous execution previously requiring larger models. The model… 27 arXiv — Machine Learning research 2d ago On the Necessity of a Liquid Substrate for Mesh Intelligence arXiv:2606.28413v1 Announce Type: new Abstract: A mesh of sovereign agents has no center: no shared clock, no shared model, and no coordinator to gather data or retrain. Its competence rests on each agent folding the projections its peers emit into a single internal state,… 8 arXiv — Machine Learning research 2d ago An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations arXiv:2606.28467v1 Announce Type: new Abstract: Appliance-level energy monitoring in office buildings produces noisy alerts that non-expert facility managers struggle to use. This paper proposes an end-to-end agentic pipeline that combines deep time-series forecasting,… 11 arXiv — Machine Learning research 2d ago Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization arXiv:2606.28764v1 Announce Type: new Abstract: Hierarchical decision-making frameworks are pivotal for addressing complex control tasks, enabling agents to decompose intricate problems into manageable subgoals. Despite their promise, existing hierarchical policies face critical… 20 arXiv — Machine Learning research 2d ago The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables arXiv:2606.28839v1 Announce Type: new Abstract: We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification… 38 arXiv — Machine Learning research 2d ago Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived… 16 arXiv — Machine Learning research 2d ago Modification-Considering Value Learning for Reward Hacking Mitigation in RL arXiv:2606.28955v1 Announce Type: new Abstract: Reinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking. Existing practical defenses typically constrain… 10 Page 2 of 10 · 500 articles ← Newer Older →