Hugging Face Daily Papers · · 7 min read

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

TRIAGE (Role-Typed Credit Assignment for Agentic Reinforcement Learning) is a framework designed to improve credit assignment in agentic RL by adding a semantic \"role\" axis to trajectory-level rewards. It addresses the \"blind spots\" of standard Group Relative Policy Optimization (GRPO), which uniformly rewards or punishes all actions in a trajectory based solely on the final outcome.</p>\n","updatedAt":"2026-07-01T18:36:40.156Z","author":{"_id":"646af200ca17a49700e94aa1","avatarUrl":"/avatars/fdee8313785f592ee11b1c879f3df775.svg","fullname":"Hejian Sang","name":"pb09204048","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9087899923324585},"editors":["pb09204048"],"editorAvatarUrls":["/avatars/fdee8313785f592ee11b1c879f3df775.svg"],"reactions":[],"isReport":false}},{"id":"6a45c349c59741ace867e9ab","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning](https://huggingface.co/papers/2605.27140) (2026)\n* [Semantic Consistency Policy Optimization for Reinforcement Learning of LLM Agents](https://huggingface.co/papers/2606.25852) (2026)\n* [Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents](https://huggingface.co/papers/2606.12634) (2026)\n* [Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers](https://huggingface.co/papers/2605.04984) (2026)\n* [What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents](https://huggingface.co/papers/2605.19447) (2026)\n* [SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning](https://huggingface.co/papers/2605.18299) (2026)\n* [TACO: Tool-Augmented Credit Optimization for Agentic Tool Use](https://huggingface.co/papers/2606.30251) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.27140\">StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.25852\">Semantic Consistency Policy Optimization for Reinforcement Learning of LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.12634\">Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.04984\">Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.19447\">What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18299\">SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.30251\">TACO: Tool-Augmented Credit Optimization for Agentic Tool Use</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:47:53.937Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7356768846511841},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.32017","authors":[{"_id":"6a4489e141f04ae4d7ad97d7","name":"Yuanda Xu","hidden":false},{"_id":"6a4489e141f04ae4d7ad97d8","name":"Zhengze Zhou","hidden":false},{"_id":"6a4489e141f04ae4d7ad97d9","name":"Hejian Sang","hidden":false},{"_id":"6a4489e141f04ae4d7ad97da","name":"Xiaomin Li","hidden":false},{"_id":"6a4489e141f04ae4d7ad97db","name":"Jiaxin Zhang","hidden":false},{"_id":"6a4489e141f04ae4d7ad97dc","name":"Xinchen Du","hidden":false},{"_id":"6a4489e141f04ae4d7ad97dd","name":"Zhipeng Wang","hidden":false},{"_id":"6a4489e141f04ae4d7ad97de","name":"Alborz Geramifard","hidden":false}],"publishedAt":"2026-06-30T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning","submittedOnDailyBy":{"_id":"646af200ca17a49700e94aa1","avatarUrl":"/avatars/fdee8313785f592ee11b1c879f3df775.svg","isPro":false,"fullname":"Hejian Sang","user":"pb09204048","type":"user","name":"pb09204048"},"summary":"Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable -- so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional 10.4% and 14.8% relative to GRPO.","upvotes":6,"discussionId":"6a4489e141f04ae4d7ad97df","ai_summary":"TRIAGE introduces a role-typed credit assignment framework that enhances agentic reinforcement learning by providing more nuanced credit assignment than standard GRPO methods.","ai_keywords":["agentic reinforcement learning","credit assignment","GRPO","verifier outcome","semantic role axis","structured judge","role-conditioned rule","advantage estimation","policy gradients","ALFWorld","Search-QA","WebShop"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6697e878d8b5b78e6e7485b7","name":"LinkedIn","fullname":"LinkedIn","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6697e834fd52271e0b9ce8d8/VSBDJkmYgk4-LeXgTKThN.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"646af200ca17a49700e94aa1","avatarUrl":"/avatars/fdee8313785f592ee11b1c879f3df775.svg","isPro":false,"fullname":"Hejian Sang","user":"pb09204048","type":"user"},{"_id":"69a2f97d2927b6e8151f8563","avatarUrl":"/avatars/96bf509fd2e3b90cd339801e8b9b33cb.svg","isPro":false,"fullname":"XYX","user":"xuyd16","type":"user"},{"_id":"69b40b69ec3b730b1dd2b07d","avatarUrl":"/avatars/577357a7de3ac735a12492c2d97c3dcc.svg","isPro":false,"fullname":"Joseph","user":"lkdsmr","type":"user"},{"_id":"641f5d45eefe94aff6e195ae","avatarUrl":"/avatars/a30e41b67276d141736c4043e42c0f21.svg","isPro":false,"fullname":"ab","user":"Shenbixiaoxin","type":"user"},{"_id":"6a4566c59aa1c54fe5e04c3b","avatarUrl":"/avatars/e0716068851c7062e0947a89258c0ef8.svg","isPro":false,"fullname":"Josep","user":"jlzkxjcg2","type":"user"},{"_id":"69e04f07a12f3e3854a0f4f1","avatarUrl":"/avatars/ead875ac27240dba208ae986cd11ef18.svg","isPro":false,"fullname":"DPS","user":"hebaba","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6697e878d8b5b78e6e7485b7","name":"LinkedIn","fullname":"LinkedIn","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6697e834fd52271e0b9ce8d8/VSBDJkmYgk4-LeXgTKThN.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.32017.md","query":{}}">
Papers
arxiv:2606.32017

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

Published on Jun 30
· Submitted by
Hejian Sang
on Jul 1
Authors:
,
,
,
,
,
,
,

Abstract

TRIAGE introduces a role-typed credit assignment framework that enhances agentic reinforcement learning by providing more nuanced credit assignment than standard GRPO methods.

Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable -- so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional 10.4% and 14.8% relative to GRPO.

Community

Paper submitter about 7 hours ago

TRIAGE (Role-Typed Credit Assignment for Agentic Reinforcement Learning) is a framework designed to improve credit assignment in agentic RL by adding a semantic "role" axis to trajectory-level rewards. It addresses the "blind spots" of standard Group Relative Policy Optimization (GRPO), which uniformly rewards or punishes all actions in a trajectory based solely on the final outcome.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.32017
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.32017 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.32017 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.32017 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers