News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — NLP / Computation & Language research 14d ago Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text arXiv:2606.18471v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic… 11 arXiv — NLP / Computation & Language research 14d ago LegalWorld: A Life-Cycle Interactive Environment for Legal Agents arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators… 37 arXiv — NLP / Computation & Language research 14d ago RedactionBench arXiv:2606.18782v1 Announce Type: new Abstract: Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction… 22 arXiv — NLP / Computation & Language research 14d ago G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is… 6 arXiv — NLP / Computation & Language research 14d ago ForecastBench-Sim: A Simulated-World Forecasting Benchmark arXiv:2606.18686v1 Announce Type: cross Abstract: Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce… 30 arXiv — NLP / Computation & Language research 14d ago IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge… 35 arXiv — NLP / Computation & Language research 14d ago ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark arXiv:2505.23851v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution… 38 arXiv — NLP / Computation & Language research 14d ago FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on… 35 Hugging Face Daily Papers research 14d ago IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products Abstract IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical… 24 Hugging Face Daily Papers research 14d ago Physics-IQ Verified Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video… 29 Hugging Face Daily Papers research 14d ago Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games Abstract A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory… 23 Hugging Face official-blog 14d ago Is it agentic enough? Benchmarking open models on your own tooling Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a… 26 r/MachineLearning community 14d ago How do you analyze the relative "strength" of probes? [R] This question is related to topics like language+ models (including multimodal) and things like "circuit" analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I'm trying to orient to the SoTA. I found this old post on trying… 21 arXiv — NLP / Computation & Language research 15d ago LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks arXiv:2606.17579v1 Announce Type: cross Abstract: Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input… 22 arXiv — NLP / Computation & Language research 15d ago Translating the Untranslatable: An Operationalizable Ontology for Untranslatability arXiv:2606.17354v1 Announce Type: new Abstract: Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations… 16 arXiv — NLP / Computation & Language research 15d ago NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned,… 12 arXiv — NLP / Computation & Language research 15d ago The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the… 7 arXiv — NLP / Computation & Language research 15d ago ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions arXiv:2606.17905v1 Announce Type: new Abstract: Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests… 10 arXiv — NLP / Computation & Language research 15d ago ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues arXiv:2606.18237v1 Announce Type: new Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale… 36 arXiv — NLP / Computation & Language research 15d ago SpeechDx: A Multi-Task Benchmark for Clinical Speech AI arXiv:2606.17339v1 Announce Type: cross Abstract: Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated… 15 arXiv — NLP / Computation & Language research 15d ago PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents arXiv:2606.17467v1 Announce Type: cross Abstract: Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with… 17 arXiv — NLP / Computation & Language research 15d ago EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent arXiv:2606.17698v1 Announce Type: cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked.… 24 arXiv — NLP / Computation & Language research 15d ago Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering arXiv:2606.17799v1 Announce Type: cross Abstract: Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically… 33 arXiv — NLP / Computation & Language research 15d ago Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models arXiv:2606.18142v1 Announce Type: cross Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts,… 21 arXiv — NLP / Computation & Language research 15d ago The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act arXiv:2606.18158v1 Announce Type: cross Abstract: Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the… 38 arXiv — NLP / Computation & Language research 15d ago EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning… 38 Hugging Face Daily Papers research 15d ago ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions Abstract ChLogic benchmark reveals persistent performance gaps between English and Chinese logical reasoning in large language models, influenced by surface realization differences and translation artifacts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models… 37 Hugging Face Daily Papers research 15d ago ProCUA-SFT Technical Report Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents… 4 Hugging Face Daily Papers research 15d ago Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and… 19 OpenAI official-blog 15d ago Introducing LifeSciBench Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions. 19 r/LocalLLaMA community 16d ago bartowski/command-a-plus-05-2026-GGUF · Hugging Face Try with latest llama.cpp version. Share your t/s benchmarks & feedback   submitted by   /u/pmttyji [link]   [comments] 6 r/MachineLearning community 16d ago I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D] Spent the last few weeks on a benchmark/harness that tries to answer one question honestly: did a robot arm actually do the demonstrated task, or did the success metric just get fooled? The setup: compile a human demo into an object-centric graph (what changed in the world:… 7 NVIDIA Developer Blog official-blog 16d ago NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium.... 17 Hugging Face Daily Papers research 16d ago MVEB: Massive Video Embedding Benchmark Abstract A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset… 7 The Information — AI news-outlet 16d ago Index Startup Ornn Launches Anthropic, OpenAI Token Benchmarks Ornn, a startup that tracks the cost of computing power for artificial intelligence, has launched a service to track the price of tokens produced by the leading AI labs. The new benchmark comes as AI firms’ customers and financial backers search for better ways to track major AI… 9 Hugging Face Daily Papers research 16d ago Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long… 28 Hugging Face Daily Papers research 16d ago PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Abstract PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.… 13 arXiv — Machine Learning research 16d ago Benchmarking Instance-Dependent Label Noise with Controlled Corruptions arXiv:2606.14965v1 Announce Type: new Abstract: Synthetic instance-dependent label noise (IDN) benchmarks are widely used to evaluate noisy-label learning methods, yet existing approaches typically generate noise through imperfect annotators or classifier raters, leaving the… 21 arXiv — Machine Learning research 16d ago Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,… 23 arXiv — Machine Learning research 16d ago EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data… 9 arXiv — Machine Learning research 16d ago Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models arXiv:2606.15436v1 Announce Type: new Abstract: Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and… 28 arXiv — NLP / Computation & Language research 16d ago Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models arXiv:2606.15044v1 Announce Type: new Abstract: Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that… 35 arXiv — NLP / Computation & Language research 16d ago CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction arXiv:2606.15069v1 Announce Type: new Abstract: Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the… 20 arXiv — NLP / Computation & Language research 16d ago Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation arXiv:2606.15152v1 Announce Type: new Abstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal… 10 arXiv — NLP / Computation & Language research 16d ago Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus arXiv:2606.15345v1 Announce Type: new Abstract: Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and… 21 arXiv — NLP / Computation & Language research 16d ago EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management arXiv:2606.15532v1 Announce Type: new Abstract: Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only… 26 arXiv — NLP / Computation & Language research 16d ago Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation… 28 arXiv — NLP / Computation & Language research 16d ago EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries arXiv:2606.15735v1 Announce Type: new Abstract: Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making.… 26 arXiv — NLP / Computation & Language research 16d ago Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations arXiv:2606.15903v1 Announce Type: new Abstract: Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes… 21 arXiv — NLP / Computation & Language research 16d ago FinBalance: A Multi-Document Accounting Reconciliation Benchmark arXiv:2606.15949v1 Announce Type: new Abstract: Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a… 32 Page 6 of 10 · 500 articles ← Newer Older →