Tag

Benchmark

500 articles archived under #benchmark · RSS

arXiv — NLP / Computation & Language research 14d ago

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

arXiv:2606.18471v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic…

11
arXiv — NLP / Computation & Language research 14d ago

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators…

37
arXiv — NLP / Computation & Language research 14d ago

RedactionBench

arXiv:2606.18782v1 Announce Type: new Abstract: Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction…

22
arXiv — NLP / Computation & Language research 14d ago

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is…

6
arXiv — NLP / Computation & Language research 14d ago

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

arXiv:2606.18686v1 Announce Type: cross Abstract: Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce…

30
arXiv — NLP / Computation & Language research 14d ago

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge…

35
arXiv — NLP / Computation & Language research 14d ago

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

arXiv:2505.23851v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution…

38
arXiv — NLP / Computation & Language research 14d ago

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on…

35
Hugging Face Daily Papers research 14d ago

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Abstract IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical…

24
Hugging Face Daily Papers research 14d ago

Physics-IQ Verified

Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video…

29
Hugging Face Daily Papers research 14d ago

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Abstract A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory…

23
Hugging Face official-blog 14d ago

Is it agentic enough? Benchmarking open models on your own tooling

Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a…

26
r/MachineLearning community 14d ago

How do you analyze the relative "strength" of probes? [R]

This question is related to topics like language+ models (including multimodal) and things like "circuit" analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I'm trying to orient to the SoTA. I found this old post on trying…

21
arXiv — NLP / Computation & Language research 15d ago

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

arXiv:2606.17579v1 Announce Type: cross Abstract: Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input…

22
arXiv — NLP / Computation & Language research 15d ago

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

arXiv:2606.17354v1 Announce Type: new Abstract: Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations…

16
arXiv — NLP / Computation & Language research 15d ago

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned,…

12
arXiv — NLP / Computation & Language research 15d ago

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the…

7
arXiv — NLP / Computation & Language research 15d ago

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

arXiv:2606.17905v1 Announce Type: new Abstract: Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests…

10
arXiv — NLP / Computation & Language research 15d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

arXiv:2606.18237v1 Announce Type: new Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale…

36
arXiv — NLP / Computation & Language research 15d ago

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

arXiv:2606.17339v1 Announce Type: cross Abstract: Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated…

15
arXiv — NLP / Computation & Language research 15d ago

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

arXiv:2606.17467v1 Announce Type: cross Abstract: Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with…

17
arXiv — NLP / Computation & Language research 15d ago

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

arXiv:2606.17698v1 Announce Type: cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked.…

24
arXiv — NLP / Computation & Language research 15d ago

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

arXiv:2606.17799v1 Announce Type: cross Abstract: Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically…

33
arXiv — NLP / Computation & Language research 15d ago

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

arXiv:2606.18142v1 Announce Type: cross Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts,…

21
arXiv — NLP / Computation & Language research 15d ago

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

arXiv:2606.18158v1 Announce Type: cross Abstract: Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the…

38
arXiv — NLP / Computation & Language research 15d ago

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning…

38
Hugging Face Daily Papers research 15d ago

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Abstract ChLogic benchmark reveals persistent performance gaps between English and Chinese logical reasoning in large language models, influenced by surface realization differences and translation artifacts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

37
Hugging Face Daily Papers research 15d ago

ProCUA-SFT Technical Report

Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents…

4
Hugging Face Daily Papers research 15d ago

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and…

19
OpenAI official-blog 15d ago

Introducing LifeSciBench

Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.

19
r/LocalLLaMA community 16d ago

bartowski/command-a-plus-05-2026-GGUF · Hugging Face

Try with latest llama.cpp version. Share your t/s benchmarks & feedback   submitted by   /u/pmttyji [link]   [comments]

6
r/MachineLearning community 16d ago

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

Spent the last few weeks on a benchmark/harness that tries to answer one question honestly: did a robot arm actually do the demonstrated task, or did the success metric just get fooled? The setup: compile a human demo into an object-centric graph (what changed in the world:…

7
NVIDIA Developer Blog official-blog 16d ago

NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance

NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium....

17
Hugging Face Daily Papers research 16d ago

MVEB: Massive Video Embedding Benchmark

Abstract A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset…

7
The Information — AI news-outlet 16d ago

Index Startup Ornn Launches Anthropic, OpenAI Token Benchmarks

Ornn, a startup that tracks the cost of computing power for artificial intelligence, has launched a service to track the price of tokens produced by the leading AI labs. The new benchmark comes as AI firms’ customers and financial backers search for better ways to track major AI…

9
Hugging Face Daily Papers research 16d ago

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long…

28
Hugging Face Daily Papers research 16d ago

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Abstract PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.…

13
arXiv — Machine Learning research 16d ago

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

arXiv:2606.14965v1 Announce Type: new Abstract: Synthetic instance-dependent label noise (IDN) benchmarks are widely used to evaluate noisy-label learning methods, yet existing approaches typically generate noise through imperfect annotators or classifier raters, leaving the…

21
arXiv — Machine Learning research 16d ago

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,…

23
arXiv — Machine Learning research 16d ago

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data…

9
arXiv — Machine Learning research 16d ago

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

arXiv:2606.15436v1 Announce Type: new Abstract: Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and…

28
arXiv — NLP / Computation & Language research 16d ago

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

arXiv:2606.15044v1 Announce Type: new Abstract: Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that…

35
arXiv — NLP / Computation & Language research 16d ago

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

arXiv:2606.15069v1 Announce Type: new Abstract: Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the…

20
arXiv — NLP / Computation & Language research 16d ago

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

arXiv:2606.15152v1 Announce Type: new Abstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal…

10
arXiv — NLP / Computation & Language research 16d ago

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

arXiv:2606.15345v1 Announce Type: new Abstract: Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and…

21
arXiv — NLP / Computation & Language research 16d ago

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

arXiv:2606.15532v1 Announce Type: new Abstract: Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only…

26
arXiv — NLP / Computation & Language research 16d ago

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation…

28
arXiv — NLP / Computation & Language research 16d ago

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

arXiv:2606.15735v1 Announce Type: new Abstract: Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making.…

26
arXiv — NLP / Computation & Language research 16d ago

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

arXiv:2606.15903v1 Announce Type: new Abstract: Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes…

21
arXiv — NLP / Computation & Language research 16d ago

FinBalance: A Multi-Document Accounting Reconciliation Benchmark

arXiv:2606.15949v1 Announce Type: new Abstract: Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a…

32

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

RedactionBench

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Physics-IQ Verified

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Is it agentic enough? Benchmarking open models on your own tooling

How do you analyze the relative "strength" of probes? [R]

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ProCUA-SFT Technical Report

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Introducing LifeSciBench

bartowski/command-a-plus-05-2026-GGUF · Hugging Face

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance

MVEB: Massive Video Embedding Benchmark

Index Startup Ornn Launches Anthropic, OpenAI Token Benchmarks

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

FinBalance: A Multi-Document Accounting Reconciliation Benchmark