DOPD: Dual On-policy Distillation</p>\n","updatedAt":"2026-07-01T02:03:40.606Z","author":{"_id":"67d63e228d5c7a132cbcf39b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ynwA3Sya5irwMRCmSeLiC.png","fullname":"neil yu","name":"yxl66666","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6271205544471741},"editors":["yxl66666"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ynwA3Sya5irwMRCmSeLiC.png"],"reactions":[],"isReport":false}},{"id":"6a45c30f14bd53eae112c4f5","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:46:55.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level](https://huggingface.co/papers/2605.06387) (2026)\n* [Trust Region On-Policy Distillation](https://huggingface.co/papers/2606.01249) (2026)\n* [Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation](https://huggingface.co/papers/2606.02684) (2026)\n* [Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence](https://huggingface.co/papers/2605.13230) (2026)\n* [The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes](https://huggingface.co/papers/2605.11182) (2026)\n* [Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation](https://huggingface.co/papers/2606.10385) (2026)\n* [Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe](https://huggingface.co/papers/2605.03677) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.06387\">Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.01249\">Trust Region On-Policy Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.02684\">Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.13230\">Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11182\">The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.10385\">Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.03677\">Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:46:55.796Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7445598244667053},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30626","authors":[{"_id":"6a44750141f04ae4d7ad96bd","user":{"_id":"67d63e228d5c7a132cbcf39b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ynwA3Sya5irwMRCmSeLiC.png","isPro":false,"fullname":"neil yu","user":"yxl66666","type":"user","name":"yxl66666"},"name":"Xinlei Yu","status":"claimed_verified","statusLastChangedAt":"2026-07-01T13:31:53.166Z","hidden":false},{"_id":"6a44750141f04ae4d7ad96be","name":"Gen Li","hidden":false},{"_id":"6a44750141f04ae4d7ad96bf","name":"Qingyi Si","hidden":false},{"_id":"6a44750141f04ae4d7ad96c0","name":"Guibin Zhang","hidden":false},{"_id":"6a44750141f04ae4d7ad96c1","name":"Yuqi Xu","hidden":false},{"_id":"6a44750141f04ae4d7ad96c2","name":"Congcong Wang","hidden":false},{"_id":"6a44750141f04ae4d7ad96c3","name":"Shuai Dong","hidden":false},{"_id":"6a44750141f04ae4d7ad96c4","name":"Kaiwen Tuo","hidden":false},{"_id":"6a44750141f04ae4d7ad96c5","name":"Xiangyu Zeng","hidden":false},{"_id":"6a44750141f04ae4d7ad96c6","name":"Kaituo Feng","hidden":false},{"_id":"6a44750141f04ae4d7ad96c7","name":"Qunzhong Wang","hidden":false},{"_id":"6a44750141f04ae4d7ad96c8","user":{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user","name":"DogNeverSleep"},"name":"Yang Shi","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:44:50.820Z","hidden":false},{"_id":"6a44750141f04ae4d7ad96c9","name":"Xiaobin Hu","hidden":false},{"_id":"6a44750141f04ae4d7ad96ca","name":"Xiangyu Yue","hidden":false},{"_id":"6a44750141f04ae4d7ad96cb","name":"Jiaqi Wang","hidden":false},{"_id":"6a44750141f04ae4d7ad96cc","name":"Shuicheng Yan","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"DOPD: Dual On-policy Distillation","submittedOnDailyBy":{"_id":"67d63e228d5c7a132cbcf39b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ynwA3Sya5irwMRCmSeLiC.png","isPro":false,"fullname":"neil yu","user":"yxl66666","type":"user","name":"yxl66666"},"summary":"On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.","upvotes":76,"discussionId":"6a44750241f04ae4d7ad96cd","ai_summary":"DOPD addresses privilege illusion in on-policy distillation by dynamically routing token-level supervision between teacher and student policies based on advantage gaps and probabilities, improving capability transfer in large and vision-language models.","ai_keywords":["on-policy distillation","token-level signals","privileged information","privilege illusion","advantage-aware dual distillation","dynamic routing","capability transfer","large language models","vision-language models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67d63e228d5c7a132cbcf39b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ynwA3Sya5irwMRCmSeLiC.png","isPro":false,"fullname":"neil yu","user":"yxl66666","type":"user"},{"_id":"67079840a9bcb7459b8d2a46","avatarUrl":"/avatars/32466863c5554f20cb2775b138832ac3.svg","isPro":false,"fullname":"Kaituo Feng","user":"KaituoFeng","type":"user"},{"_id":"63199ee0307cb8119903614e","avatarUrl":"/avatars/6908be3e58eef8e3c1ac2304b32f5032.svg","isPro":false,"fullname":"xiaobinhu","user":"stablediffusionuser","type":"user"},{"_id":"647dd8f9a49bffab5d6fe46e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UkYcsNfnvKOotfKfTNcEk.png","isPro":false,"fullname":"Yin Bo","user":"YINBO0927","type":"user"},{"_id":"675bd1c1e16de4a95aae6100","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/0bSH63gLTpEi0dVywHoyn.png","isPro":false,"fullname":"Zheng Nie","user":"niez23333","type":"user"},{"_id":"67a9be57c6f66e2fa5d2791a","avatarUrl":"/avatars/0e243a46481183d77c93a617b13bd98b.svg","isPro":false,"fullname":"huangxiyan","user":"siiian","type":"user"},{"_id":"69c8d4631d39906ca9ded454","avatarUrl":"/avatars/38bf2d0a4b84334940b701bd17333af2.svg","isPro":false,"fullname":"Haojie Huang","user":"hhj-ai","type":"user"},{"_id":"6a31596bbd33b68bb695b288","avatarUrl":"/avatars/8282ff065602ada2b0d7cf800a52364b.svg","isPro":false,"fullname":"XUE","user":"XUE67","type":"user"},{"_id":"67fe4a2e70583097de30cbb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/v-YTe1QTA87QGN1UsEIEI.png","isPro":false,"fullname":"lkl","user":"lvkailin","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"6a447e111136dfd8f2bde3d3","avatarUrl":"/avatars/dee43e5ba3afc33729b084d356d06684.svg","isPro":false,"fullname":"ZhaoJQ","user":"0455Zhao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30626.md","query":{}}">
DOPD: Dual On-policy Distillation
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
DOPD addresses privilege illusion in on-policy distillation by dynamically routing token-level supervision between teacher and student policies based on advantage gaps and probabilities, improving capability transfer in large and vision-language models.
On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.
Community
DOPD: Dual On-policy Distillation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.30626 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.30626 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.30626 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.