Hugging Face Daily Papers · July 1, 2026 · 6 min read

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

When LLMs appear to give diverse math solutions, are they truly exploring different strategies—or merely rephrasing the same one? We address this question through approach-level diversity, which captures whether solutions differ in how they solve the problem, not just in how they are written.\n","updatedAt":"2026-07-01T16:06:10.890Z","author":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","fullname":"Lee SangMook","name":"sangmook12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9316532015800476},"editors":["sangmook12"],"editorAvatarUrls":["/avatars/a1077af79f41da50925f1c3931a78bd4.svg"],"reactions":[],"isReport":false}},{"id":"6a453b1649b4e7a072d43d73","author":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","fullname":"Lee SangMook","name":"sangmook12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-07-01T16:06:46.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2026-07-01T16:07:41.332Z","author":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","fullname":"Lee SangMook","name":"sangmook12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"6a45c3c3d35baf1b1ad50fb1","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:49:55.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation](https://huggingface.co/papers/2605.28022) (2026)\n* [Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning](https://huggingface.co/papers/2605.09292) (2026)\n* [Exploration-Driven Optimization for Test-Time Large Language Model Reasoning](https://huggingface.co/papers/2605.09853) (2026)\n* [Breaking $\\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning](https://huggingface.co/papers/2605.11461) (2026)\n* [Vector Policy Optimization: Training for Diversity Improves Test-Time Search](https://huggingface.co/papers/2605.22817) (2026)\n* [ExpRL: Exploratory RL for LLM Mid-Training](https://huggingface.co/papers/2606.17024) (2026)\n* [Strategy-Aware Optimization Modeling with Reasoning LLMs](https://huggingface.co/papers/2605.02545) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.28022\">Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09292\">Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09853\">Exploration-Driven Optimization for Test-Time Large Language Model Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11461\">Breaking $\\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.22817\">Vector Policy Optimization: Training for Diversity Improves Test-Time Search</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.17024\">ExpRL: Exploratory RL for LLM Mid-Training</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.02545\">Strategy-Aware Optimization Modeling with Reasoning LLMs</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code>\n","updatedAt":"2026-07-02T01:49:55.281Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.732444703578949},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.29985","authors":[{"_id":"6a447fe241f04ae4d7ad973a","user":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","isPro":false,"fullname":"Lee SangMook","user":"sangmook12","type":"user","name":"sangmook12"},"name":"Sangmook Lee","status":"claimed_verified","statusLastChangedAt":"2026-07-01T13:31:55.078Z","hidden":false},{"_id":"6a447fe241f04ae4d7ad973b","user":{"_id":"6440a11d757aa3c2ad87c8db","avatarUrl":"/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg","isPro":false,"fullname":"minbeomkim","user":"mbkim","type":"user","name":"mbkim"},"name":"Minbeom Kim","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:44:41.014Z","hidden":false},{"_id":"6a447fe241f04ae4d7ad973c","name":"Jeonghye Kim","hidden":false},{"_id":"6a447fe241f04ae4d7ad973d","name":"Dohyung Kim","hidden":false},{"_id":"6a447fe241f04ae4d7ad973e","name":"Sojeong Rhee","hidden":false},{"_id":"6a447fe241f04ae4d7ad973f","name":"Kyomin Jung","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning","submittedOnDailyBy":{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","isPro":false,"fullname":"Lee SangMook","user":"sangmook12","type":"user","name":"sangmook12"},"summary":"Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.","upvotes":13,"discussionId":"6a447fe241f04ae4d7ad9740","ai_summary":"Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches.","ai_keywords":["diversity-aware RLVR","LLM judge framework","approach-level diversity","surface-level variation","test-time scaling","policy optimization","diversity reward"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6440a11d757aa3c2ad87c8db","avatarUrl":"/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg","isPro":false,"fullname":"minbeomkim","user":"mbkim","type":"user"},{"_id":"64aba0744135aae75f51bf8e","avatarUrl":"/avatars/87ba3b0551756261795ca8ef029d81fd.svg","isPro":false,"fullname":"Do Hyung Kim","user":"kimdohyung","type":"user"},{"_id":"65b9fa2ac29f995b6df1c1b2","avatarUrl":"/avatars/a1077af79f41da50925f1c3931a78bd4.svg","isPro":false,"fullname":"Lee SangMook","user":"sangmook12","type":"user"},{"_id":"67ee792b410119083d714d2b","avatarUrl":"/avatars/cbfe07e8027846997a99e68d29c4a768.svg","isPro":false,"fullname":"Jin-Woo Kong","user":"jwkong3","type":"user"},{"_id":"6a44fea5662b409efd21bbff","avatarUrl":"/avatars/4fa17f5eed92480a8c9e46a81e9bd3bf.svg","isPro":false,"fullname":"Chung hojin","user":"Yonsei-615","type":"user"},{"_id":"675709394b5d676c93424cc9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/675709394b5d676c93424cc9/5GjTZM4Ro__wXZhgXJv78.jpeg","isPro":false,"fullname":"Hyojeong Yu","user":"stellahj","type":"user"},{"_id":"6a451189530169666d351f46","avatarUrl":"/avatars/9b027623eda056526c34953acb61aaa8.svg","isPro":false,"fullname":"seongwon jeong","user":"swjeong0302","type":"user"},{"_id":"6a325a3dcd2adbd794dd9f45","avatarUrl":"/avatars/838aa5ac60c8628b0d9730ce685c9bad.svg","isPro":false,"fullname":"Dongyeon Kim","user":"Dongyeon5","type":"user"},{"_id":"63ac5701c21e60a3e9b58aa7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ac5701c21e60a3e9b58aa7/g6EX7diOpuA94R2ab-rZC.png","isPro":true,"fullname":"Dipankar Sarkar","user":"dipankarsarkar","type":"user"},{"_id":"63e48f6d9db5da2dc1f6288e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676046878664-63e48f6d9db5da2dc1f6288e.png","isPro":false,"fullname":"JeonghyeKim","user":"beanie00","type":"user"},{"_id":"6a45b9f328e0e4836e739e04","avatarUrl":"/avatars/e67f3192e7d600be1857ea9dd8d30e04.svg","isPro":false,"fullname":"Sanha Chang","user":"GoTeuGam","type":"user"},{"_id":"6a45bb6915e14ca75b3672d8","avatarUrl":"/avatars/fa3fd8d36cd65d978e285a5a4f5158de.svg","isPro":false,"fullname":"LEE","user":"Burustar","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"},"query":{}}">

Papers

arxiv:2606.29985

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Published on Jun 29

· Submitted by

Lee SangMook on Jul 1

Seoul National University

Upvote

Authors:

Sangmook Lee ,

Minbeom Kim ,

Abstract

Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.