<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6747de57f8cab58c22ec94a2/VCniNvq5-3ukAytoRgDwc.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6747de57f8cab58c22ec94a2/VCniNvq5-3ukAytoRgDwc.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-30T02:51:20.932Z","author":{"_id":"6747de57f8cab58c22ec94a2","avatarUrl":"/avatars/5bae0341862fac24564781c0fa32aac5.svg","fullname":"Jinyang Wu","name":"Jinyang23","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4373166263103485},"editors":["Jinyang23"],"editorAvatarUrls":["/avatars/5bae0341862fac24564781c0fa32aac5.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30251","authors":[{"_id":"6a432ede763f63ca3757e8ca","name":"Mingkuan Feng","hidden":false},{"_id":"6a432ede763f63ca3757e8cb","name":"Jinyang Wu","hidden":false},{"_id":"6a432ede763f63ca3757e8cc","name":"Hao Gu","hidden":false},{"_id":"6a432ede763f63ca3757e8cd","name":"Fangrui Lv","hidden":false},{"_id":"6a432ede763f63ca3757e8ce","name":"Ruihan Jin","hidden":false},{"_id":"6a432ede763f63ca3757e8cf","name":"Chuyuan Zhang","hidden":false},{"_id":"6a432ede763f63ca3757e8d0","name":"Zhengqi Wen","hidden":false},{"_id":"6a432ede763f63ca3757e8d1","name":"Jianhua Tao","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"TACO: Tool-Augmented Credit Optimization for Agentic Tool Use","submittedOnDailyBy":{"_id":"6747de57f8cab58c22ec94a2","avatarUrl":"/avatars/5bae0341862fac24564781c0fa32aac5.svg","isPro":false,"fullname":"Jinyang Wu","user":"Jinyang23","type":"user","name":"Jinyang23"},"summary":"Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cases, and existing process rewards either fail to attribute final correctness to individual tool calls, or require an external judge model. To address this, we introduce Tool-Augmented Credit Optimization (TACO), a GRPO variant for code-tool agents built on two coupled advantage channels. The first, Differential Answer-Probe Reward (DAPR), is a self-supervised, judge-free tool-contribution advantage that credits each tool call by its own effect on answering correctly. Probe tokens inserted into the model's reasoning elicit its predictions with and without the tool, and the difference in outcome reward is taken as the call's value: positive for a useful call, negative for a misleading one, and zero for one that changes nothing. This reuses the existing answer checker with no auxiliary judge, and, being a difference rather than an absolute probe score, is naturally robust to probe-hacking. The second is the outcome advantage from the final answer, distributed by Outcome-Gated Advantage Routing (OGAR): a parameter-free rule that, conditioned on the call's outcome, delivers this credit only to the responsible segments, suppressing wasted tool calls without any cost term. We train TACO through a two-stage SFT+RL pipeline. Extensive experiments across perception, reasoning, and general multimodal benchmarks show that it yields consistent accuracy gains and learns to invoke its tools only when they help.","upvotes":15,"discussionId":"6a432ede763f63ca3757e8d2","ai_summary":"Tool-Augmented Credit Optimization (TACO) improves multimodal agent performance by distinguishing useful, redundant, or misleading code operations through dual advantage channels: Differential Answer-Probe Reward for individual tool contribution and Outcome-Gated Advantage Routing for final outcome distribution.","ai_keywords":["agentic multimodal models","code-tool agents","GRPO","Differential Answer-Probe Reward","Outcome-Gated Advantage Routing","tool-contribution advantage","probe tokens","answer checker","probe-hacking","SFT+RL pipeline"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6747de57f8cab58c22ec94a2","avatarUrl":"/avatars/5bae0341862fac24564781c0fa32aac5.svg","isPro":false,"fullname":"Jinyang Wu","user":"Jinyang23","type":"user"},{"_id":"6708e132e0c87dc0e55fc6bb","avatarUrl":"/avatars/7ba826abc56641b9418aa342520a0ed8.svg","isPro":false,"fullname":"fmk","user":"fmk99","type":"user"},{"_id":"69b8e8687b73e4b3800b8e43","avatarUrl":"/avatars/c7c1fcfe2547955a17d71ce27a283bc9.svg","isPro":false,"fullname":"James C.-Y. Zhang","user":"jamescyzhang","type":"user"},{"_id":"640bfda15d36b591f04465f6","avatarUrl":"/avatars/aa1467172cdd34c56bd5a9d1933b114a.svg","isPro":false,"fullname":"Gu","user":"yuanqi","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"661f9344142a51d630d945b4","avatarUrl":"/avatars/87c1e6f5941a5221d94a3219efdfbe6a.svg","isPro":false,"fullname":"hxy","user":"hexiangyu941","type":"user"},{"_id":"68d776766f86d4412cbaaff9","avatarUrl":"/avatars/dd4caac2baf29de1fe631b3d858fb75a.svg","isPro":false,"fullname":"Kehan Chen","user":"kehanchen","type":"user"},{"_id":"651ed7ef755e92f7f12742e6","avatarUrl":"/avatars/57a9cc189b4a59299aad6c96191b18d8.svg","isPro":false,"fullname":"yu li","user":"lyabc","type":"user"},{"_id":"67b2cf802ea5fd965bf5899c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/flyg2O_9IbrjPK9qHm8oi.png","isPro":false,"fullname":"hezhao","user":"heyjoy730","type":"user"},{"_id":"67da7aa4dab8cc723c2ffd3c","avatarUrl":"/avatars/3d4c8feb385271c6198b373d08d2e987.svg","isPro":false,"fullname":"lv","user":"fangrui49","type":"user"},{"_id":"69c9391eccb5f3b29e286351","avatarUrl":"/avatars/4965f0a84deb20b9ae7529fc76a6b281.svg","isPro":false,"fullname":"Yanyan","user":"testhugginglmf","type":"user"},{"_id":"69785080ad94585f418c8979","avatarUrl":"/avatars/9a6a6b65d5234757327ba5c25d69ecbc.svg","isPro":false,"fullname":"sh","user":"hubsailor","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30251.md","query":{}}">
TACO: Tool-Augmented Credit Optimization for Agentic Tool Use
Abstract
Tool-Augmented Credit Optimization (TACO) improves multimodal agent performance by distinguishing useful, redundant, or misleading code operations through dual advantage channels: Differential Answer-Probe Reward for individual tool contribution and Outcome-Gated Advantage Routing for final outcome distribution.
Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cases, and existing process rewards either fail to attribute final correctness to individual tool calls, or require an external judge model. To address this, we introduce Tool-Augmented Credit Optimization (TACO), a GRPO variant for code-tool agents built on two coupled advantage channels. The first, Differential Answer-Probe Reward (DAPR), is a self-supervised, judge-free tool-contribution advantage that credits each tool call by its own effect on answering correctly. Probe tokens inserted into the model's reasoning elicit its predictions with and without the tool, and the difference in outcome reward is taken as the call's value: positive for a useful call, negative for a misleading one, and zero for one that changes nothing. This reuses the existing answer checker with no auxiliary judge, and, being a difference rather than an absolute probe score, is naturally robust to probe-hacking. The second is the outcome advantage from the final answer, distributed by Outcome-Gated Advantage Routing (OGAR): a parameter-free rule that, conditioned on the call's outcome, delivers this credit only to the responsible segments, suppressing wasted tool calls without any cost term. We train TACO through a two-stage SFT+RL pipeline. Extensive experiments across perception, reasoning, and general multimodal benchmarks show that it yields consistent accuracy gains and learns to invoke its tools only when they help.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.30251 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.30251 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.30251 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.