Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — NLP / Computation & Language research 1d ago

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

arXiv:2606.30801v1 Announce Type: new Abstract: Personalization algorithms determine what content users encounter on online platforms. Auditing these systems is difficult because independent auditors have only black-box access to the algorithms, while personalization depends on…

37
arXiv — NLP / Computation & Language research 1d ago

A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

arXiv:2606.31041v1 Announce Type: new Abstract: Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names,…

12
arXiv — NLP / Computation & Language research 1d ago

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with…

7
arXiv — NLP / Computation & Language research 1d ago

Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas?

arXiv:2606.31213v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed as moral advisors and agents, they need to address dilemmas between two competing values. However, existing research on LLMs with moral dilemmas overlooks a central aspect…

11
arXiv — NLP / Computation & Language research 1d ago

When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue

arXiv:2606.31307v1 Announce Type: new Abstract: Large language models used in task-oriented dialogue often produce fluent but unsafe responses when backend database calls fail, return empty results, or surface mismatched information, inventing venues, confirmations, or booking…

16
arXiv — NLP / Computation & Language research 1d ago

FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

arXiv:2606.31522v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision…

19
arXiv — NLP / Computation & Language research 1d ago

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

arXiv:2606.31551v1 Announce Type: new Abstract: Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that…

37
arXiv — NLP / Computation & Language research 1d ago

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper…

25
arXiv — NLP / Computation & Language research 1d ago

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

arXiv:2606.31980v1 Announce Type: new Abstract: Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions…

36
arXiv — NLP / Computation & Language research 1d ago

Generative Skill Composition for LLM Agents

arXiv:2606.32025v1 Announce Type: new Abstract: Recent LLM agents benefit from skills for solving complex tasks. Skills encapsulate modular packages of procedural knowledge and instructions for performing specialized tasks, such as setting up a sandboxed environment, running a…

34
arXiv — NLP / Computation & Language research 1d ago

Emergent Culture in Minimal LLM Systems

arXiv:2606.30668v1 Announce Type: cross Abstract: What happens when LLM agents operate with no context outside a turn, minimal prompting, and simple tools? Inspired by swarm engineering, we give collectives of three agents the ability to send messages and manipulate a shared…

22
arXiv — NLP / Computation & Language research 1d ago

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite…

29
arXiv — NLP / Computation & Language research 1d ago

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

arXiv:2606.31270v1 Announce Type: cross Abstract: Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these…

20
arXiv — NLP / Computation & Language research 1d ago

The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills

arXiv:2606.31272v1 Announce Type: cross Abstract: AI agents increasingly acquire and execute skills at runtime: bundles of prompt instructions, executable code, and tool declarations fetched from marketplaces and other agents. Governing them needs a stable notion of skill…

16
arXiv — NLP / Computation & Language research 1d ago

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

arXiv:2606.31693v1 Announce Type: cross Abstract: The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation…

38
arXiv — NLP / Computation & Language research 1d ago

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

arXiv:2606.31966v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a…

4
arXiv — NLP / Computation & Language research 1d ago

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

arXiv:2601.04126v3 Announce Type: replace Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present…

29
Hugging Face Daily Papers research 1d ago

Dockerless: Environment-Free Program Verifier for Coding Agents

Abstract A Dockerless environment-free agentic patch verifier improves code patch evaluation accuracy and enables effective post-training without execution-based verification costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Program verifiers play a central role in training…

21
Hugging Face Daily Papers research 1d ago

LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents

Abstract LUMOS provides a semantic interaction layer that converts operating system metadata into machine-readable formats, enabling AI agents to interact more efficiently with computer interfaces than through traditional visual methods. Generated by…

34
Hugging Face Daily Papers research 1d ago

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Abstract OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing computer-use benchmarks…

24
r/MachineLearning community 1d ago

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

  submitted by   /u/julian88888888 [link]   [comments]

13
Vercel — AI dev-tools 1d ago

Vercel Security Dashboard is in private beta

The Vercel Security Dashboard is now in private beta. It aggregates the security posture of every account and project on Vercel. As teams grow and coding agents make it easy to spin up new projects, small misconfigurations can add up quietly and quickly. Team members without…

26
Vercel — AI dev-tools 1d ago

Enforce consistent code for agents and humans with konsistent

konsistent is now open source. konsistent is a CLI linter for TypeScript codebases that enforces structural conventions, giving agents and humans the consistent context they need to implement features correctly. Deterministic, fast, and covers structural patterns that TypeScript…

6
r/LocalLLaMA community 1d ago

Vibe Coding / Agentic workflow

Hey folks. I know that vibe coding is frowned upon pretty solidly here, and I get that, but I’m not a programmer. I just don’t realistically have the time to learn python or C++ to the level I would need to to build some of the things I’d like to create. On a side note, I do…

28
TechCrunch — AI news-outlet 1d ago

OpenClaw is finally available on Android and iOS

The free open source agentic program is finally invading your phone.

21
r/LocalLLaMA community 1d ago

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

Repo → huggingface.co/LordNeel/Agents-A1-GGUF I made GGUF quants of InternScience/Agents-A1 — a 35B Mixture of Experts agent model (Qwen3.5-MoE, ~3B active, 256 experts / 8+1 active, hybrid linear+full attention, 256K context). It's built for long-horizon search, tool-calling,…

27
Anthropic SDK (Python) releases dev-tools 1d ago

v0.115.0

0.115.0 (2026-06-30) Full Changelog: v0.114.0...v0.115.0 Features api: add support for Managed Agents event delta streaming, agent overrides, reverse pagination, vault credential injection scoping, and agent and deployment webhook events ( 8c23f7e )

24
Hugging Face official-blog 1d ago

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Back to Articles a]:hidden"> ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration Enterprise Article Published June 30, 2026 Upvote - Raju Pavuluri rpavuluri ibm-research Rahul Krishna rkrsn ibm-research Srikanth Govindaraj Tamilselvam stamilse ibm-research…

13
r/LocalLLaMA community 1d ago

The harness matters more than the model. A 27B behind good critics changed my mind.

I saw someone test Qwen3.6-27B with a 3-critic harness. The harness included code review, test review and Playwright e2e. Each critic had context. The result was that the model is usable for coding work. This matches what I have come to believe from running agents in production.…

20
TechCrunch — AI news-outlet 1d ago

Anthropic launches Claude Sonnet 5 as a cheaper way to run agents

Anthropic’s Claude Sonnet 5 brings stronger agentic capabilities, lower pricing, and improved safety, positioning the model as a cheaper alternative to Opus, GPT-5.5, and Gemini Pro.

5
TechCrunch — AI news-outlet 1d ago

Acti puts AI agents directly into your smartphone keyboard

Startup Acti is betting the smartphone keyboard is the next home for AI assistants. Its new keyboard for iOS and Android works across apps and lets users create custom AI-powered shortcuts using natural language.

24
Simon Willison community 1d ago

Have your agent record video demos of its work with shot-scraper video

shot-scraper video is a new command introduced in today's shot-scraper 1.10 release which accepts a storyboard.yml file defining a routine to run against a web application and uses Playwright to record a video of that routine. I've written before about the importance of having…

31
Hugging Face Daily Papers research 1d ago

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Abstract SWE-Together is a multi-turn coding benchmark created from real user-agent interactions, featuring a reactive LLM simulator to evaluate agents based on both final correctness and interaction efficiency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Most coding-agent…

32
Hugging Face Daily Papers research 1d ago

RocketSmith: Agentic Additive Manufacturing of High-Powered Rockets

Abstract An agentic system using large language models automates high-power rocket design processes, enabling successful flight testing with consistent simulation results. Generated by Qwen/Qwen2.5-Coder-32B-Instruct RocketSmith is an agentic system which intelligently automates…

9
MIT News — AI research 1d ago

Q&A: What is agentic AI today, and what do we want it to be?

Computer scientist Phillip Isola cuts through the hype to explain how AI agents work and what the future might hold for this rapidly advancing technology.

19
Simon Willison community 1d ago

shot-scraper 1.10

Release: shot-scraper 1.10 The big new feature is shot-scraper video storyboard.yml , described in detail in Have your agent record video demos of its work with shot-scraper video . Tags: shot-scraper

11
TechCrunch — AI news-outlet 1d ago

X now offers an MCP server to make its platform easier for AI tools to use

X has launched a hosted MCP server, making it easier for developers to connect AI applications with the company’s API.

24
r/LocalLLaMA community 1d ago

I built an autonomous dev pipeline and ran the same project head to head: a 27B local on a modded 4090, then again on cheap cloud LLMs

Hey everyone! I open-sourced something I've been working on called Lullabeast. It's an autonomous dev pipeline. You describe your project and planner, executor, and reviewer agents build it phase by phase against a real git repo. How it came to be: for the last year or so I've…

9
TechCrunch — AI news-outlet 1d ago

Amazon launches new $1 billion FDE org, following OpenAI and Anthropic

Engineers on the new team will embed within companies to deploy purpose-built agents, focusing on fast deployments and customer self-sufficiency.

28
r/LocalLLaMA community 1d ago

I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

Been running agents locally for a while and kept hitting the same issue: the more tools I added, the worse the model got at picking the right one.. So I finally benchmarked it properly.. Setup: qwen3.5-class model on an M4 MacBook, 100 tools in the catalog. One run with the full…

23
Hugging Face Daily Papers research 1d ago

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

Abstract Agentic abstention involves determining when an AI agent should cease interaction under uncertainty, requiring sequential decision-making across multiple environments and task types. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are expected to act over…

13
TechCrunch — AI news-outlet 1d ago

Crypto exchange OKX wants AI agents to hire and pay each other

OKX is bringing together payments, identity and reputation into a marketplace for AI agents.

29
Hugging Face Daily Papers research 2d ago

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Abstract TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents. Generated by…

4
Smol AI News news-outlet 2d ago

not much happened today

**Anthropic** launched **Claude Sonnet 5** as its new default mid-tier frontier model, featuring a **1M-token context window**, enhanced agentic capabilities including planning, browser and terminal tool use, and autonomous execution previously requiring larger models. The model…

27
arXiv — Machine Learning research 2d ago

On the Necessity of a Liquid Substrate for Mesh Intelligence

arXiv:2606.28413v1 Announce Type: new Abstract: A mesh of sovereign agents has no center: no shared clock, no shared model, and no coordinator to gather data or retrain. Its competence rests on each agent folding the projections its peers emit into a single internal state,…

8
arXiv — Machine Learning research 2d ago

An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations

arXiv:2606.28467v1 Announce Type: new Abstract: Appliance-level energy monitoring in office buildings produces noisy alerts that non-expert facility managers struggle to use. This paper proposes an end-to-end agentic pipeline that combines deep time-series forecasting,…

11
arXiv — Machine Learning research 2d ago

Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization

arXiv:2606.28764v1 Announce Type: new Abstract: Hierarchical decision-making frameworks are pivotal for addressing complex control tasks, enabling agents to decompose intricate problems into manageable subgoals. Despite their promise, existing hierarchical policies face critical…

20
arXiv — Machine Learning research 2d ago

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

arXiv:2606.28839v1 Announce Type: new Abstract: We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification…

38
arXiv — Machine Learning research 2d ago

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived…

16
arXiv — Machine Learning research 2d ago

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

arXiv:2606.28955v1 Announce Type: new Abstract: Reinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking. Existing practical defenses typically constrain…

10

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas?

When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue

FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

Generative Skill Composition for LLM Agents

Emergent Culture in Minimal LLM Systems

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Dockerless: Environment-Free Program Verifier for Coding Agents

LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

Vercel Security Dashboard is in private beta

Enforce consistent code for agents and humans with konsistent

Vibe Coding / Agentic workflow

OpenClaw is finally available on Android and iOS

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

v0.115.0

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

The harness matters more than the model. A 27B behind good critics changed my mind.

Anthropic launches Claude Sonnet 5 as a cheaper way to run agents

Acti puts AI agents directly into your smartphone keyboard

Have your agent record video demos of its work with shot-scraper video

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

RocketSmith: Agentic Additive Manufacturing of High-Powered Rockets

Q&A: What is agentic AI today, and what do we want it to be?

shot-scraper 1.10

X now offers an MCP server to make its platform easier for AI tools to use

I built an autonomous dev pipeline and ran the same project head to head: a 27B local on a modded 4090, then again on cheap cloud LLMs

Amazon launches new $1 billion FDE org, following OpenAI and Anthropic

I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

Crypto exchange OKX wants AI agents to hire and pay each other

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

not much happened today

On the Necessity of a Liquid Substrate for Mesh Intelligence

An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations

Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

Modification-Considering Value Learning for Reward Hacking Mitigation in RL