When does the 10,001st sample stop helping? And can more sampling ever hurt? Reframing single-model sampling as cluster sampling answers both: effective draws saturate at a hard correlation ceiling 1/ρ (about 2 on released logs), and selection is capped by a modal ceiling π_mode that anti-scales where the mode is wrong. Coverage climbing while majority voting plateaus is an identifiability gap, not a compute limit. Measured on public logs, fully reproducible: <a href=\"https://github.com/bay-yearick-lab/sampling-ceilings\" rel=\"nofollow\">https://github.com/bay-yearick-lab/sampling-ceilings</a></p>\n","updatedAt":"2026-07-02T21:35:50.163Z","author":{"_id":"6a3ad62a4a31738971d2de5d","avatarUrl":"/avatars/b63241bb5fd78811ed899bcc95d8c642.svg","fullname":"Yong Yi Bay","name":"bay-yearick-lab","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9059863090515137},"editors":["bay-yearick-lab"],"editorAvatarUrls":["/avatars/b63241bb5fd78811ed899bcc95d8c642.svg"],"reactions":[{"reaction":"🔥","users":["bay-yearick-lab"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.28661","authors":[{"_id":"6a43b07ec8741d01182cfa98","user":{"_id":"6a3ad62a4a31738971d2de5d","avatarUrl":"/avatars/b63241bb5fd78811ed899bcc95d8c642.svg","isPro":false,"fullname":"Yong Yi Bay","user":"bay-yearick-lab","type":"user","name":"bay-yearick-lab"},"name":"Yong Yi Bay","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:45:34.907Z","hidden":false},{"_id":"6a43b07ec8741d01182cfa99","name":"Kathleen A. Yearick","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6a3ad62a4a31738971d2de5d/Pcrstrn4aMz2llEOZV2vE.png"],"publishedAt":"2026-06-27T00:00:00.000Z","submittedOnDailyAt":"2026-07-02T00:00:00.000Z","title":"When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling","submittedOnDailyBy":{"_id":"6a3ad62a4a31738971d2de5d","avatarUrl":"/avatars/b63241bb5fd78811ed899bcc95d8c642.svg","isPro":false,"fullname":"Yong Yi Bay","user":"bay-yearick-lab","type":"user","name":"bay-yearick-lab"},"summary":"People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a correct answer turns up somewhere, so coverage, the fraction of problems with at least one correct try, climbs and appears to be progress. But a deployed system must return one answer, and choosing it, not knowing which try is right, is selection; selection is capped, and past a point extra samples only make the model surer of a confident mistake, even as every draw adds cost. The gap between climbing coverage and stalled selection, the identifiability gap, is the answer a model can produce but not pick. So the real question is not whether to sample but how far, and the answer is: not far. For picking an answer, the vote has already settled within a few dozen draws, the modal ceiling; for scoring a benchmark, sooner still, the correlation ceiling. Beyond that, extra draws cost compute and add nothing, and can even make the answer worse. This paper turns the cutoff into a single number, the effective number of samples, that any sampling run already reveals. The bottleneck is recognizing a right answer, not generating one.","upvotes":1,"discussionId":"6a43b07ec8741d01182cfa9a","projectPage":"https://arxiv.org/abs/2606.28661","githubRepo":"https://github.com/bay-yearick-lab/sampling-ceilings","githubRepoAddedBy":"user","ai_summary":"Sampling-based reasoning systems face a trade-off between coverage and selection, where additional samples beyond a few dozen provide diminishing returns and can degrade performance.","ai_keywords":["test-time scaling","coverage","selection","identifiability gap","effective number of samples","modal ceiling","correlation ceiling"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a3ad62a4a31738971d2de5d","avatarUrl":"/avatars/b63241bb5fd78811ed899bcc95d8c642.svg","isPro":false,"fullname":"Yong Yi Bay","user":"bay-yearick-lab","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.28661.md","query":{}}">
When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling
Abstract
Sampling-based reasoning systems face a trade-off between coverage and selection, where additional samples beyond a few dozen provide diminishing returns and can degrade performance.
People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a correct answer turns up somewhere, so coverage, the fraction of problems with at least one correct try, climbs and appears to be progress. But a deployed system must return one answer, and choosing it, not knowing which try is right, is selection; selection is capped, and past a point extra samples only make the model surer of a confident mistake, even as every draw adds cost. The gap between climbing coverage and stalled selection, the identifiability gap, is the answer a model can produce but not pick. So the real question is not whether to sample but how far, and the answer is: not far. For picking an answer, the vote has already settled within a few dozen draws, the modal ceiling; for scoring a benchmark, sooner still, the correlation ceiling. Beyond that, extra draws cost compute and add nothing, and can even make the answer worse. This paper turns the cutoff into a single number, the effective number of samples, that any sampling run already reveals. The bottleneck is recognizing a right answer, not generating one.
Community
When does the 10,001st sample stop helping? And can more sampling ever hurt? Reframing single-model sampling as cluster sampling answers both: effective draws saturate at a hard correlation ceiling 1/ρ (about 2 on released logs), and selection is capped by a modal ceiling π_mode that anti-scales where the mode is wrong. Coverage climbing while majority voting plateaus is an identifiability gap, not a compute limit. Measured on public logs, fully reproducible: https://github.com/bay-yearick-lab/sampling-ceilings
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.28661 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.28661 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.28661 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.