News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow Hugging Face Daily Papers research 8d ago AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning Abstract Large language models face challenges in archive-grounded reasoning tasks involving evidence retrieval and synthesis across diverse document collections, with performance varying significantly across domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language… 26 Latent.Space news-outlet 8d ago [AINews] Claude Tag: Multiplayer, Proactive, Persistent Agents in Slack Claude finally gets a Slackbot upgrade 8 Hugging Face Daily Papers research 8d ago OpenThoughts-Agent: Data Recipes for Agentic Models Abstract An open-source data curation pipeline for training agentic language models is presented, demonstrating superior performance through systematic experimentation and scalable training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic language models dramatically… 34 Hugging Face Daily Papers research 8d ago LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis Abstract A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Mental… 36 r/LocalLLaMA community 8d ago Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments Qwen just released Qwen-AgentWorld-35B-A3B — a 35B-parameter MoE with only ~3B active parameters per token. The interesting part: this is not positioned as a standard chat/instruction model or a full autonomous agent. It is a language world model trained to predict what an… 6 r/LocalLLaMA community 8d ago GitHub - QwenLM/Qwen-AgentWorld: Qwen-AgentWorld: Language World Models for General Agents   submitted by   /u/dan945 [link]   [comments] 5 arXiv — Machine Learning research 8d ago Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets arXiv:2606.23961v1 Announce Type: new Abstract: Long-context and agentic LLM workloads push the KV cache past any fixed memory budget, forcing the inference stack to permanently evict tokens at every step of a continuous-inference stream. Existing methods all share the same… 20 arXiv — NLP / Computation & Language research 8d ago When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents arXiv:2606.23937v1 Announce Type: new Abstract: Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B… 11 arXiv — Machine Learning research 8d ago Critique of Agent Model arXiv:2606.23991v1 Announce Type: cross Abstract: What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same… 31 arXiv — Machine Learning research 8d ago Toward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines arXiv:2606.24598v1 Announce Type: cross Abstract: While expert-validated "LLM + script" workflows deliver significant value, they remain static: they encode hard-won domain knowledge yet fail to adapt execution based on feedback. Existing agent research predominantly targets… 22 arXiv — Machine Learning research 8d ago ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning arXiv:2606.24601v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target… 29 arXiv — NLP / Computation & Language research 8d ago Metis: Bridging Text and Code Memory for Self-Evolving Agents arXiv:2606.24151v1 Announce Type: new Abstract: Self-evolving agents improve over time by distilling experience from past executions and reusing it in future tasks. Existing systems represent such experience either as natural-language text injected into the agent context or as… 38 arXiv — NLP / Computation & Language research 8d ago Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning arXiv:2606.24428v1 Announce Type: new Abstract: Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent… 17 arXiv — NLP / Computation & Language research 8d ago AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of… 36 arXiv — NLP / Computation & Language research 8d ago NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? arXiv:2606.24530v1 Announce Type: new Abstract: We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real… 21 arXiv — NLP / Computation & Language research 8d ago MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery arXiv:2606.24595v1 Announce Type: new Abstract: Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through… 32 arXiv — NLP / Computation & Language research 8d ago Qwen-AgentWorld: Language World Models for General Agents arXiv:2606.24597v1 Announce Type: new Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can… 8 arXiv — NLP / Computation & Language research 8d ago Are We Ready For An Agent-Native Memory System? arXiv:2606.24775v1 Announce Type: new Abstract: Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic… 8 arXiv — NLP / Computation & Language research 8d ago Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce arXiv:2606.24783v1 Announce Type: new Abstract: Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2)… 23 arXiv — NLP / Computation & Language research 8d ago SHERLOC: Structured Diagnostic Localization for Code Repair Agents arXiv:2606.24820v1 Announce Type: new Abstract: LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval… 20 arXiv — NLP / Computation & Language research 8d ago Bayesian control for coding agents arXiv:2606.24453v1 Announce Type: cross Abstract: Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty.… 21 arXiv — NLP / Computation & Language research 8d ago CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark arXiv:2409.11363v2 Announce Type: replace Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially,… 20 Hugging Face Daily Papers research 8d ago Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning Abstract EDV is a three-stage framework that uses multiple heterogeneous agents to collaboratively construct reliable experiences for LLM agents, preventing self-confirmatory errors through execute-distill-verify processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 29 Hugging Face Daily Papers research 8d ago Qwen-AgentWorld: Language World Models for General Agents Abstract Language-based world models enable agentic environment simulation across multiple domains and enhance general agent performance through scalable simulation and improved downstream task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A world model predicts… 16 Hugging Face Daily Papers research 8d ago NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? Abstract NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation… 21 Hugging Face Daily Papers research 8d ago ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection Abstract A comprehensive multimodal misinformation detection framework is introduced that handles complex, multilingual content with multiple images and diverse verification approaches, achieving superior performance while reducing computational costs. Generated by… 29 Hugging Face Daily Papers research 8d ago MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management Abstract MemGUI-Agent addresses long-horizon mobile GUI task limitations through proactive context management using Context-as-Action (ConAct) to maintain critical information across extended sequences. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents… 32 Hugging Face Daily Papers research 8d ago MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization Abstract MobileForge enables efficient adaptation of mobile GUI agents through annotation-free learning by combining real app interaction grounding with hierarchical feedback-guided policy optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents… 18 TechCrunch — AI news-outlet 8d ago India’s MoEngage bets that the future of marketing is millions of AI agents The all-cash deal gives MoEngage access to technology that assigns AI agents to individual customers. 17 r/LocalLLaMA community 8d ago Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000) For agentic work fast high context is king, OpenCode fills the window quickly and most models that feel snappy at 8k context turn into dial-up ADSL brrr by the time you're at 150k context deep. So I've been testing lots of models and runners trying to get "local Sonnet" on 2x… 14 r/LocalLLaMA community 8d ago MiniMax2.7 @47tg 1200pp MiniMax 2.7 REAP Q4 on 96GB VRAM and 192 GB DDR5 udimm ram on a b840 MSI board and 9900X cpu. 1250W PSU and all cards are power limited. Linux Ubuntu. Agent class model. Excellent instruction following and tool calling. I run this model in a round robin loop with 3 sequencing… 19 r/LocalLLaMA community 8d ago Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL) Hey everyone, wanted to share some work on making the new Tmax-27B terminal agent actually runnable on consumer hardware. What is Tmax-27B? Ai2 just released Tmax, a family of terminal-agent LLMs trained with DPPO (RL) on top of Qwen3.6. The 27B model hits ~43% on Terminal Bench… 32 Hugging Face Daily Papers research 8d ago Self-Compacting Language Model Agents Abstract SelfCompact is a scaffolding approach that enables models to autonomously determine optimal compaction timing and methods for managing long agent traces, achieving better performance with reduced token costs compared to fixed-interval methods. Generated by… 13 Hugging Face Daily Papers research 8d ago When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents Abstract Pre premature commitment in long-horizon LLM agents leads to silent failures where agents defend early interpretations without considering alternatives, and hidden-state convergence serves as an early diagnostic for trajectory consistency. Generated by… 24 Hugging Face Daily Papers research 9d ago Libretto: Giving LLM Agents a Sense of Musical Structure Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from… 18 Hugging Face Daily Papers research 9d ago Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? Abstract Computer-use agents frequently expose inappropriate information across applications, prompting the creation of AgentCIBench to evaluate and mitigate privacy risks in cross-application contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use agents (CUAs) now… 7 NVIDIA Developer Blog official-blog 9d ago Build an AI Scientist for Life Science Discovery with NVIDIA BioNeMo Agent Toolkit AI scientists are emerging as a new interface for scientific computing. These agents can read papers, write code, generate hypotheses, call APIs, inspect files,... 12 TechCrunch — AI news-outlet 9d ago Fika Jobs raises $4M to build a video-first hiring platform where AI agents interview candidates Stockholm-based startup Fika Jobs is building a video-first hiring platform that combines AI interview agents with short-form video profiles, creating something that feels like a cross between LinkedIn and TikTok. 5 Hugging Face official-blog 9d ago Build real agentic apps using CUGA: two dozen working examples on a lightweight harness Back to Articles a]:hidden"> Build real agentic apps using CUGA: two dozen working examples on a lightweight harness Enterprise Article Published June 23, 2026 Upvote - Anupama Murthi anupamamurthi ibm-research Hamid Adebayo harmedox ibm-research Sami Marreed samimarreed… 30 Hugging Face Daily Papers research 9d ago Training Open Models for Agentic Phone Use Abstract PhoneBuddy combines real and mock app environments to improve training of open models for phone use, demonstrating enhanced task success rates through mixed reinforcement learning approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Phones are becoming an important… 11 r/LocalLLaMA community 9d ago Pooled round robin hardware with friends? I have a rig Friend1 has rig Friend2 has a rig Each rig idle 90% of the time With agentic, how could we round robin as a group? So when I load up, it checks if friends rigs are idle (vpn etc) and if idle farms out tasks. If I understand right, agents work in parallel, so this… 29 Hugging Face Daily Papers research 9d ago Counsel: A Meta-Evaluation Dataset for Agentic Tasks Abstract A large-scale dataset of human-metaevaluations of LLM critiques for agentic tasks is introduced to improve the calibration and reliability of automated evaluation methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As agentic systems tackle increasingly complex… 22 r/LocalLLaMA community 9d ago My local server idling 99% of the time! Guys what you running to make agents busy? Like some crazy 24/7 tasks, or maybe some useful ideas on how to utilize local llm with some purpose/use? I personally running Qwen3.6-27B with owu and with pi for coding (little-coder) but as in title - it’s idling all the time…  … 33 Hugging Face Daily Papers research 9d ago Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills Abstract Notes2Skills framework converts laboratory notes into verifiable skills for AI agents while maintaining author uncertainty levels, addressing gaps in scientific AI development. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Scientific discovery workflows usually contain… 27 Hugging Face Daily Papers research 9d ago SkillHarness: Harnessing Safe Skills for Computer-Use Agents Abstract SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-Use Agents (CUAs)… 24 Hugging Face Daily Papers research 9d ago AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction Abstract AOHP presents an Android-based operating system framework that treats AI agents as first-class entities, enhancing task completion rates and reducing execution costs through specialized agent-oriented mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI agents… 16 NVIDIA Developer Blog official-blog 9d ago How Telcos Build Autonomous Networks with Agentic AI Telecom operators are adopting AI across network operations, customer care, and back-office workflows, but most are still early in the journey to autonomy. In... 37 r/LocalLLaMA community 9d ago Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL? To train Qwen 3.5 4B or 9B for a custom multi-tool agent workflow and would appreciate guidance from people who have done this successfully. A few questions: SFT → RL or RL-only? - Is it still recommended to first do supervised fine-tuning (tool-calling traces, reasoning… 15 Hugging Face Daily Papers research 9d ago DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured… 19 Smol AI News news-outlet 9d ago not much happened today **Prime Intellect's `prime-rl` v0.6.0** advances agentic reinforcement learning infrastructure supporting **1 trillion parameter MoE models** with sub-5-minute step times and a **131k context GLM-5 agentic setup**. The release includes optimizations in inference, training, and… 37 Page 6 of 10 · 500 articles ← Newer Older →