DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.
Computer Science > Machine Learning
Title:DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
Abstract:Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce DualEval, a latent model-item calibration framework that represents models and evaluation items in a shared space, jointly estimating model ability together with item difficulty and sharpness. We apply DualEval across four domains: coding, math, miscellaneous domain-knowledge tasks, and generic everyday user queries. Our evaluation uses 18 frontier LLMs, static benchmark labels, and reward-model scores validated against held-out human preferences for open-ended model responses. Empirically, our framework produces reliable and balanced model rankings, and its learned item-level profiles support downstream applications such as benchmark compression for sample-efficient evaluation and anomaly detection for contamination or outlier analysis. Overall, DualEval unifies static and arena-style evaluation through joint model-item calibration, producing model rankings and item-level diagnostics that support more sample-efficient, interpretable, and auditable evaluation pipelines.
| Subjects: | Machine Learning (cs.LG); Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.26429 [cs.LG] |
| (or arXiv:2606.26429v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2606.26429
arXiv-issued DOI via DataCite (pending registration)
|
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — Machine Learning
-
Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol
Jul 2
-
SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling
Jul 2
-
SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification
Jul 2
-
A Filtered Mixture-of-Generators for Fully Synthetic Survival Training
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.