Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas
Abstract:We propose an LLM harness that generates code-based policy functions for multi-agent environments, evaluates them with self-play, and refines them using feedback from previous iterations. Following the recent line of work in feedback engineering (the design of which information signals are shown to the LLM during refinement), we compare sparse feedback (scalar reward only) with dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). In two Sequential Social Dilemmas (Gathering and Cleanup) and with two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback improves over or matches sparse feedback on all metrics. We explain this asymmetry via feedback aliasing: when the scalar reward maps distinct failure modes into the same value (e.g., under- vs. over-cleaning), social metrics disambiguate and allow the LLM to diagnose which direction of improvement to take. We conclude that social metrics act as a coordination signal, leading to strategies such as Voronoi territory partitioning and adaptive cleaner schedules.
Code at this https URL.
| Comments: | Accepted to NExT-Game 2026: New Frontiers in Game-Theoretic Learning, ICML 2026 Workshop. Camera-ready version |
| Subjects: | Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT) |
| Cite as: | arXiv:2603.19453 [cs.CL] |
| (or arXiv:2603.19453v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2603.19453
arXiv-issued DOI via DataCite
|
Submission history
From: Victor Gallego [view email][v1] Thu, 19 Mar 2026 20:27:48 UTC (64 KB)
[v2] Mon, 1 Jun 2026 16:03:40 UTC (25 KB)
[v3] Tue, 30 Jun 2026 09:14:49 UTC (38 KB)
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity
Jul 2
-
Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds
Jul 2
-
EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
Jul 2
-
Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.