Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — Machine Learning research 2h ago

Play Like Champions: Counterfactual Feedback Generation in Latent Space

arXiv:2607.00190v1 Announce Type: new Abstract: Recent advances in reinforcement learning have produced superhuman agents across a wide range of competitive games. As a byproduct, researchers have begun studying how these agents play, extracting behavioral representations,…

37
arXiv — NLP / Computation & Language research 2h ago

EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems

arXiv:2607.00297v1 Announce Type: cross Abstract: When LLM agents use evaluator feedback to adapt their behavior in closed loops, evaluator biases propagate through the agent's strategy distribution -- a phenomenon known as evaluator preference coupling. Prior work has…

37
arXiv — Machine Learning research 2h ago

Distributed Online Bandit Submodular Maximization with Bounded Sampling Violations

arXiv:2607.00680v1 Announce Type: new Abstract: We study distributed online submodular maximization under partition matroid constraints, in which multiple agents select a limited number of actions from their own subsets sequentially to maximize the cumulative value of a sequence…

30
arXiv — Machine Learning research 2h ago

Task-Relevant Representation Decoupling for Visual Reinforcement Learning Generalization

arXiv:2607.00796v1 Announce Type: new Abstract: Visual Reinforcement Learning (VRL) has achieved considerable success in solving control tasks. However, generalizing learned policies to new environments remains a major challenge, as agents often overfit to task-irrelevant…

32
arXiv — Machine Learning research 2h ago

Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos

arXiv:2607.00808v1 Announce Type: new Abstract: Pre-training on large-scale videos to improve reinforcement learning efficiency is promising yet remains challenging. Existing methods typically treat the agent as an indivisible entity, modeling motion patterns globally. Such…

8
arXiv — NLP / Computation & Language research 2h ago

TRACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data

arXiv:2607.00339v1 Announce Type: new Abstract: Conversational data is increasingly used as a persistent source of user state for long-running assistants and AI agents. However, querying this data remains challenging because conversations naturally evolve: plans are revised,…

8
arXiv — NLP / Computation & Language research 2h ago

A Task-State Representation for Long-Horizon Mobile GUI Agents

arXiv:2607.00502v1 Announce Type: new Abstract: While long-horizon mobile GUI agents typically rely on thought-action-observation loops, they struggle to separate persistent task states from transient screen observations. As execution histories grow, this entanglement imposes a…

11
arXiv — NLP / Computation & Language research 2h ago

Multi-Turn Agentic Scientific Literature Search via Workflow Induction

arXiv:2607.00597v1 Announce Type: new Abstract: Scientific literature search often requires more than retrieving papers from a single query: users' intents are underspecified, preference-dependent, and evolve through interaction. Existing search agents typically rely on fixed…

26
arXiv — NLP / Computation & Language research 2h ago

Behavior-Adaptive Conversational Agents: Toward a Fluid Personality Framework

arXiv:2607.01034v1 Announce Type: new Abstract: Large language model (LLM)-based conversational agents (CAs) are now ubiquitous, creating new opportunities for AI-mediated behavior change. Their capacity to project nuanced personalities and adopt diverse metaphorical roles…

38
arXiv — NLP / Computation & Language research 2h ago

Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

arXiv:2607.01047v1 Announce Type: new Abstract: Complexity and interpretability rarely coincide: systems rich enough for complex behaviours to emerge are usually too opaque to question, while transparent ones are too simple for anything complex to emerge. A single large language…

33
arXiv — NLP / Computation & Language research 2h ago

Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

arXiv:2607.00017v1 Announce Type: cross Abstract: Long-term conversational agents are expected to remember past interactions, but memory is useful only when the right evidence is recalled for the right user. Existing memory-augmented LLM agents have made progress in building…

30
arXiv — NLP / Computation & Language research 2h ago

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

arXiv:2607.00233v1 Announce Type: cross Abstract: How do two agents invent a shared language from scratch? In a Lewis signaling game, a sender and receiver must coordinate on a code using only their interaction history. We study five memory architectures across varying channel…

26
arXiv — NLP / Computation & Language research 2h ago

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

arXiv:2607.00394v1 Announce Type: cross Abstract: LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain largely ad-hoc. We formalize this as an online semantic cache replacement…

22
arXiv — NLP / Computation & Language research 2h ago

Self-Evolving Agents with Anytime-Valid Certificates

arXiv:2607.00871v1 Announce Type: cross Abstract: Self-evolving agents violate the assumption behind most learning-theoretic guarantees: the data, evaluator, components, and hypothesis space are produced by the policy being updated. We present \textbf{SEA}, an architecture that…

27
arXiv — NLP / Computation & Language research 2h ago

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

arXiv:2607.01061v1 Announce Type: cross Abstract: Computer-assisted synthesis planning breaks target molecules into accessible precursors using large libraries of reaction rules that assign each transformation a deterministic, interpretable label. But chemistry is long-tailed,…

17
arXiv — NLP / Computation & Language research 2h ago

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

arXiv:2510.24636v3 Announce Type: replace Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and…

34
arXiv — NLP / Computation & Language research 2h ago

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

arXiv:2511.07397v3 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller,…

22
r/MachineLearning community 2h ago

SentryCode: Real-time Auditor + Honeytokens for AI Coding Agents [P]

In light of recent privacy concerns arising from local AI coding agents performing telemetry, environmental scanning, and hidden cue fingerprinting, I've open-sourced SentryCode—a kernel-level behavior auditing tool. It logs file/network/cue activity, uses honeypot tokens for…

12
r/LocalLLaMA community 5h ago

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

Just wanted to share that I was looking for an optimal way to run Ornith 35B in FP8 with E4M3 and MTP with vLLM but there was no out-of-the-box model with MTP drafter support. So I grafted this new model! It's 18% faster than without MTP and the drafter acceptance rate is not…

31
Latent.Space news-outlet 6h ago

Autoresearch: The feedback loop behind self-improving agents

Introspection co-founder Roland Gavrilescu explains autoresearch, agent “recipes,” self-improving loops, and why humans remain central to the software factory.

15
r/LocalLLaMA community 9h ago

ZCode: New Agentic Code Editor from the Makers of GLM

  submitted by   /u/johnnyApplePRNG [link]   [comments]

16
Latent.Space news-outlet 11h ago

How Cursor deploys AI inside the enterprise

Cursor's Pauline Brunet explains how her team of Forward Deployed Engineers help organizations implement agents — essentially setting up software factories.

35
r/LocalLLaMA community 11h ago

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Some backstory I've been working on my local agent (openclaw), and I wanted to give it the skill to reconstruct calendar entries from a photo of the screen. I couldn't get at the calendar through an API (long story), so a photo was the only low-friction way to export the data.…

16
Hugging Face Daily Papers research 11h ago

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

Abstract TRIAGE introduces a role-typed credit assignment framework that enhances agentic reinforcement learning by providing more nuanced credit assignment than standard GRPO methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic reinforcement learning requires assigning…

26
TechCrunch — AI news-outlet 12h ago

Cloudflare’s new policy pushes AI companies to pay for publishers’ content

Cloudflare is giving AI companies until September 15 to separate web crawlers used for search from those used for AI training and agents, or risk being blocked by default on many publisher sites.

16
Hugging Face Daily Papers research 12h ago

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Abstract SWE-Interact presents a testbed that evaluates coding agents in realistic multi-turn, user-driven software engineering scenarios, revealing significant gaps between single-turn performance and interactive task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

6
r/LocalLLaMA community 13h ago

Plurality Released: fully Free and Open Source AI agents/chatbot platform for local AI

Hello everyone! Some of you might recognize my user from the work I have done on Cosmos Cloud, but today I am here to talk to you about an entirely different project: Plurality. https://github.com/azukaar/plurality Plurality has been in development for a bit more than a year and…

22
NVIDIA Developer Blog official-blog 13h ago

Mastering Agentic Techniques: AI Agent Reinforcement Learning

Reinforcement learning (RL) is central to aligning language models, from reinforcement learning with human feedback (RLHF) within AI assistants to newer...

38
Hugging Face Daily Papers research 13h ago

Hierarchical Experimentalist Agents

Abstract HExA enables large language models to improve through active experimentation and skill learning in novel domains without requiring training or external supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models (LLMs) are increasingly used to take…

24
Hugging Face Daily Papers research 14h ago

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Abstract Act2Answer protocol evaluates embodied vision-language-action models by having agents answer questions through physical actions, revealing knowledge retention and generalization patterns across different semantic categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
Hugging Face Daily Papers research 14h ago

Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

Abstract Grounded word learning experiments using visual embeddings and lexical learners reveal that perceptual distance, rather than semantic relatedness, determines acquisition success, with distinct patterns in naming and retrieval performance. Generated by…

34
r/LocalLLaMA community 14h ago

Open Models - June 2026

After overwhelming April , OK May , here's June. Yeah, Graph has only less items. Because we got other items here last month. Finetunes : Nex-N2 Ornith-1.0 Agents-A1 Holo3.1 Tmax-27b MusaCoder-27B VibeThinker-3B NVFP4 from NVIDIA for below models :…

8
TechCrunch — AI news-outlet 15h ago

Gemini Spark, Google’s agentic assistant, is now available on Mac

Google's 24/7 agentic assistant, Gemini Spark, comes to Mac alongside other improvements, like real-time tracking and support for more apps.

35
r/LocalLLaMA community 16h ago

Agent execution visualizer

I've seen projects which stream tool use status and subagent generation, and represented it with a nice little visual based on the tool being used, etc. It would be pretty cool to pair this with some live model visualisations like a QKV heatmap across attention heads. Not for…

28
Hugging Face Daily Papers research 17h ago

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Abstract A testbed called QVal is introduced for evaluating dense supervision signals in long-horizon LLM agent tasks by measuring how well method scores align with Q-values, enabling fair comparison of different supervision approaches without training. Generated by…

22
r/LocalLLaMA community 18h ago

Hister: Give Your AI Assistant a Private Memory

I have been working on Hister, a self hosted search engine that automatically indexes pages you visit, local files, and documentation, then keeps them searchable with stored offline previews. It also exposes an MCP endpoint, so local AI assistants can search your own indexed…

5
Hugging Face Daily Papers research 19h ago

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

Abstract Procedural memory enhances LLM agents on workplace tasks through skill transfer across roles and models, with varying generalization capabilities affecting deployment strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Procedural memory is increasingly used to…

22
r/MachineLearning community 20h ago

A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]

Prompt injection has emerged as one of the most persistent failure modes in tool-using LLM systems, particularly in agentic workflows where models interact with external data sources. Most mitigation strategies focus on input filtering or model-side alignment, but these…

9
Hugging Face Daily Papers research 20h ago

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

Abstract SkillHone enables continuous evolution of agent skills by maintaining persistent decision histories and incorporating practice feedback for improved performance across research and tool-mediated analysis tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agent skills…

35
Hugging Face Daily Papers research 20h ago

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

Abstract DataEvolver is a self-evolving multi-agent framework that improves text-rich image generation by leveraging feedback from rejected samples to iteratively enhance data quality. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-rich image generation is one of the most…

11
Hugging Face Daily Papers research 21h ago

Xiaomi-GUI-0 Technical Report

Abstract A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface (GUI) agents build on…

7
r/LocalLLaMA community 23h ago

Ketch - Best Search Tool for local models

recently I wrote a blog post, to find which search tool will be best for the pi coding agent paired with local models (currently I use Qwen3.6 35B) Before that I were using firecrawl or brave-search, but found them very decent, so I went to SearXNG, which is fine, but lacks some…

38
Latent.Space news-outlet 1d ago

AIEWF Daily Dispatch: Loops, Software Factories & Forward Deployed Engineers

On Tuesday at the AI Engineer World's Fair, there was a lot of talk about loops, agent engineering, and the emergence of software factories. Also a hot topic: open models.

34
arXiv — Machine Learning research 1d ago

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

arXiv:2606.31154v1 Announce Type: new Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely…

25
arXiv — Machine Learning research 1d ago

Expected Gain-based Escalation in Vertical Federated Learning

arXiv:2606.31331v1 Announce Type: new Abstract: Collaborative inference can improve predictive performance by integrating complementary information across agents, but applying collaborative fusion to every sample can incur unnecessary communication and computational overhead.…

17
arXiv — NLP / Computation & Language research 1d ago

Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

arXiv:2606.31371v1 Announce Type: cross Abstract: When large language model (LLM) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent's learned strategy distribution - a phenomenon termed evaluator preference coupling.…

38
arXiv — Machine Learning research 1d ago

ECHO: Prune to act, trace to learn with selective turn memory in agentic RL

arXiv:2606.31650v1 Announce Type: new Abstract: Long-horizon language agents must repeatedly interact with tools, accumulate evidence, and make decisions under bounded context windows. Existing context-management methods make such rollouts feasible by truncating distant history,…

12
arXiv — Machine Learning research 1d ago

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

arXiv:2606.32017v1 Announce Type: new Abstract: Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform…

13
arXiv — NLP / Computation & Language research 1d ago

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

arXiv:2606.32034v1 Announce Type: cross Abstract: LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the…

36
arXiv — NLP / Computation & Language research 1d ago

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

arXiv:2606.30775v1 Announce Type: new Abstract: Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term…

25

Play Like Champions: Counterfactual Feedback Generation in Latent Space

EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems

Distributed Online Bandit Submodular Maximization with Bounded Sampling Violations

Task-Relevant Representation Decoupling for Visual Reinforcement Learning Generalization

Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos

TRACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data

A Task-State Representation for Long-Horizon Mobile GUI Agents

Multi-Turn Agentic Scientific Literature Search via Workflow Induction

Behavior-Adaptive Conversational Agents: Toward a Fluid Personality Framework

Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

Self-Evolving Agents with Anytime-Valid Certificates

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

SentryCode: Real-time Auditor + Honeytokens for AI Coding Agents [P]

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

Autoresearch: The feedback loop behind self-improving agents

ZCode: New Agentic Code Editor from the Makers of GLM

How Cursor deploys AI inside the enterprise

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

Cloudflare&#8217;s new policy pushes AI companies to pay for publishers&#8217; content

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Plurality Released: fully Free and Open Source AI agents/chatbot platform for local AI

Mastering Agentic Techniques: AI Agent Reinforcement Learning

Hierarchical Experimentalist Agents

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

Open Models - June 2026

Gemini Spark, Google&#8217;s agentic assistant, is now available on Mac

Agent execution visualizer

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Hister: Give Your AI Assistant a Private Memory

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

Xiaomi-GUI-0 Technical Report

Ketch - Best Search Tool for local models

AIEWF Daily Dispatch: Loops, Software Factories & Forward Deployed Engineers

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

Expected Gain-based Escalation in Vertical Federated Learning

Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

ECHO: Prune to act, trace to learn with selective turn memory in agentic RL

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

Cloudflare’s new policy pushes AI companies to pay for publishers’ content

Gemini Spark, Google’s agentic assistant, is now available on Mac