Hugging Face Daily Papers · · 6 min read

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

When LLMs appear to give diverse math solutions, are they truly exploring different strategies—or merely rephrasing the same one? We address this question through approach-level diversity, which captures whether solutions differ in how they solve the problem, not just in how they are written.</p>\n","updatedAt":"2026-07-01T16:06:10.890Z","author":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","fullname":"Lee SangMook","name":"sangmook12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9316532015800476},"editors":["sangmook12"],"editorAvatarUrls":["/avatars/a1077af79f41da50925f1c3931a78bd4.svg"],"reactions":[],"isReport":false}},{"id":"6a453b1649b4e7a072d43d73","author":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","fullname":"Lee SangMook","name":"sangmook12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-07-01T16:06:46.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2026-07-01T16:07:41.332Z","author":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","fullname":"Lee SangMook","name":"sangmook12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"6a45c3c3d35baf1b1ad50fb1","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:49:55.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation](https://huggingface.co/papers/2605.28022) (2026)\n* [Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning](https://huggingface.co/papers/2605.09292) (2026)\n* [Exploration-Driven Optimization for Test-Time Large Language Model Reasoning](https://huggingface.co/papers/2605.09853) (2026)\n* [Breaking $\\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning](https://huggingface.co/papers/2605.11461) (2026)\n* [Vector Policy Optimization: Training for Diversity Improves Test-Time Search](https://huggingface.co/papers/2605.22817) (2026)\n* [ExpRL: Exploratory RL for LLM Mid-Training](https://huggingface.co/papers/2606.17024) (2026)\n* [Strategy-Aware Optimization Modeling with Reasoning LLMs](https://huggingface.co/papers/2605.02545) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.28022\">Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09292\">Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09853\">Exploration-Driven Optimization for Test-Time Large Language Model Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11461\">Breaking $\\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.22817\">Vector Policy Optimization: Training for Diversity Improves Test-Time Search</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.17024\">ExpRL: Exploratory RL for LLM Mid-Training</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.02545\">Strategy-Aware Optimization Modeling with Reasoning LLMs</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:49:55.281Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.732444703578949},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.29985","authors":[{"_id":"6a447fe241f04ae4d7ad973a","user":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","isPro":false,"fullname":"Lee SangMook","user":"sangmook12","type":"user","name":"sangmook12"},"name":"Sangmook Lee","status":"claimed_verified","statusLastChangedAt":"2026-07-01T13:31:55.078Z","hidden":false},{"_id":"6a447fe241f04ae4d7ad973b","user":{"_id":"6440a11d757aa3c2ad87c8db","avatarUrl":"/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg","isPro":false,"fullname":"minbeomkim","user":"mbkim","type":"user","name":"mbkim"},"name":"Minbeom Kim","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:44:41.014Z","hidden":false},{"_id":"6a447fe241f04ae4d7ad973c","name":"Jeonghye Kim","hidden":false},{"_id":"6a447fe241f04ae4d7ad973d","name":"Dohyung Kim","hidden":false},{"_id":"6a447fe241f04ae4d7ad973e","name":"Sojeong Rhee","hidden":false},{"_id":"6a447fe241f04ae4d7ad973f","name":"Kyomin Jung","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning","submittedOnDailyBy":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","isPro":false,"fullname":"Lee SangMook","user":"sangmook12","type":"user","name":"sangmook12"},"summary":"Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.","upvotes":13,"discussionId":"6a447fe241f04ae4d7ad9740","ai_summary":"Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches.","ai_keywords":["diversity-aware RLVR","LLM judge framework","approach-level diversity","surface-level variation","test-time scaling","policy optimization","diversity reward"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6440a11d757aa3c2ad87c8db","avatarUrl":"/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg","isPro":false,"fullname":"minbeomkim","user":"mbkim","type":"user"},{"_id":"64aba0744135aae75f51bf8e","avatarUrl":"/avatars/87ba3b0551756261795ca8ef029d81fd.svg","isPro":false,"fullname":"Do Hyung Kim","user":"kimdohyung","type":"user"},{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","isPro":false,"fullname":"Lee SangMook","user":"sangmook12","type":"user"},{"_id":"67ee792b410119083d714d2b","avatarUrl":"/avatars/cbfe07e8027846997a99e68d29c4a768.svg","isPro":false,"fullname":"Jin-Woo Kong","user":"jwkong3","type":"user"},{"_id":"6a44fea5662b409efd21bbff","avatarUrl":"/avatars/4fa17f5eed92480a8c9e46a81e9bd3bf.svg","isPro":false,"fullname":"Chung hojin","user":"Yonsei-615","type":"user"},{"_id":"675709394b5d676c93424cc9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/675709394b5d676c93424cc9/5GjTZM4Ro__wXZhgXJv78.jpeg","isPro":false,"fullname":"Hyojeong Yu","user":"stellahj","type":"user"},{"_id":"6a451189530169666d351f46","avatarUrl":"/avatars/9b027623eda056526c34953acb61aaa8.svg","isPro":false,"fullname":"seongwon jeong","user":"swjeong0302","type":"user"},{"_id":"6a325a3dcd2adbd794dd9f45","avatarUrl":"/avatars/838aa5ac60c8628b0d9730ce685c9bad.svg","isPro":false,"fullname":"Dongyeon Kim","user":"Dongyeon5","type":"user"},{"_id":"63ac5701c21e60a3e9b58aa7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ac5701c21e60a3e9b58aa7/g6EX7diOpuA94R2ab-rZC.png","isPro":true,"fullname":"Dipankar Sarkar","user":"dipankarsarkar","type":"user"},{"_id":"63e48f6d9db5da2dc1f6288e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676046878664-63e48f6d9db5da2dc1f6288e.png","isPro":false,"fullname":"JeonghyeKim","user":"beanie00","type":"user"},{"_id":"6a45b9f328e0e4836e739e04","avatarUrl":"/avatars/e67f3192e7d600be1857ea9dd8d30e04.svg","isPro":false,"fullname":"Sanha Chang","user":"GoTeuGam","type":"user"},{"_id":"6a45bb6915e14ca75b3672d8","avatarUrl":"/avatars/fa3fd8d36cd65d978e285a5a4f5158de.svg","isPro":false,"fullname":"LEE","user":"Burustar","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"},"query":{}}">
Papers
arxiv:2606.29985

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Published on Jun 29
· Submitted by
Lee SangMook
on Jul 1
Authors:
,
,
,

Abstract

Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches.

Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.

Community

Paper author Paper submitter about 10 hours ago

When LLMs appear to give diverse math solutions, are they truly exploring different strategies—or merely rephrasing the same one? We address this question through approach-level diversity, which captures whether solutions differ in how they solve the problem, not just in how they are written.

Paper author Paper submitter about 10 hours ago
This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.29985 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.29985 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.29985 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers