Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average 6.80%, 7.82% and 4.45% relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average 16.4% relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly available.</p>\n","updatedAt":"2026-06-09T17:03:36.171Z","author":{"_id":"666ddefe83571a7a05af7870","avatarUrl":"/avatars/74b88573973b5508e47f5af4044b14a6.svg","fullname":"Halil Ibrahim Gulluk","name":"gulluk","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9307788610458374},"editors":["gulluk"],"editorAvatarUrls":["/avatars/74b88573973b5508e47f5af4044b14a6.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.00440","authors":[{"_id":"6a202b6d15100c5272a84257","name":"Halil Ibrahim Gulluk","hidden":false},{"_id":"6a202b6d15100c5272a84258","name":"Max Van Puyvelde","hidden":false},{"_id":"6a202b6d15100c5272a84259","name":"Wim Van Criekinge","hidden":false},{"_id":"6a202b6d15100c5272a8425a","name":"Olivier Gevaert","hidden":false}],"publishedAt":"2026-05-30T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"SDR: Set-Distance Rewards for Radiology Report Generation","submittedOnDailyBy":{"_id":"666ddefe83571a7a05af7870","avatarUrl":"/avatars/74b88573973b5508e47f5af4044b14a6.svg","isPro":false,"fullname":"Halil Ibrahim Gulluk","user":"gulluk","type":"user","name":"gulluk"},"summary":"Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \\%6.80, \\%7.82 and \\%4.45 relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \\%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\\% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}.","upvotes":3,"discussionId":"6a202b6e15100c5272a8425b","projectPage":"https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA","ai_summary":"Set-based rewards using embedding distances improve chest X-ray report generation by enabling effective post-training and test-time selection without requiring causal reasoning structures.","ai_keywords":["set-to-set distances","sentence transformer","embedding sets","GRPO","BERTScore","RadGraph","CheXbert","best-of-N selection","test-time scaling","streaming signal","pruning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6900c65ccd6f5a08e9683db2","name":"StanfordUniversityy","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6900c5742cc80701f360da45/6RB8XN4KUNvDsQlYhg0gl.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a2848b45a4ee9efb03b5c0c","avatarUrl":"/avatars/3dc95c7d8c1c7c725ae25fda11cd0b30.svg","isPro":false,"fullname":"Sena Ulutas","user":"senaulutas","type":"user"},{"_id":"666ddefe83571a7a05af7870","avatarUrl":"/avatars/74b88573973b5508e47f5af4044b14a6.svg","isPro":false,"fullname":"Halil Ibrahim Gulluk","user":"gulluk","type":"user"},{"_id":"68210ad4f29d70e1cccc86be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/peHjJRYBM1-tjTORiyASd.png","isPro":false,"fullname":"mxvp","user":"mxvp","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6900c65ccd6f5a08e9683db2","name":"StanfordUniversityy","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6900c5742cc80701f360da45/6RB8XN4KUNvDsQlYhg0gl.png"}}">
SDR: Set-Distance Rewards for Radiology Report Generation
Abstract
Set-based rewards using embedding distances improve chest X-ray report generation by enabling effective post-training and test-time selection without requiring causal reasoning structures.
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}.
Community
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average 6.80%, 7.82% and 4.45% relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average 16.4% relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly available.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.00440 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.00440 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.00440 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.