Tag

Benchmark

500 articles archived under #benchmark · RSS

arXiv — NLP / Computation & Language research 16d ago

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning…

30
arXiv — NLP / Computation & Language research 16d ago

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

arXiv:2606.16009v1 Announce Type: new Abstract: Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains…

23
arXiv — NLP / Computation & Language research 16d ago

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

arXiv:2606.16011v1 Announce Type: new Abstract: Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a…

28
arXiv — NLP / Computation & Language research 16d ago

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We…

33
arXiv — NLP / Computation & Language research 16d ago

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can…

15
arXiv — NLP / Computation & Language research 16d ago

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

arXiv:2606.16211v1 Announce Type: new Abstract: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However,…

36
Hugging Face Daily Papers research 16d ago

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Abstract VibeThinker-3B demonstrates that compact models can achieve state-of-the-art performance on verifiable reasoning tasks through specialized training techniques, challenging conventional scaling assumptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This technical…

16
Hugging Face Daily Papers research 16d ago

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as…

32
r/LocalLLaMA community 16d ago

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

Link to last post Before anything else, I'd like to sincerely thank u/jipok_ for helping out by highlighting a few weak questions, categories and scoring issues, which have now been addressed (Dropping >100 questions, tuning the scoring methodology for more accuracy, etc).…

19
r/LocalLLaMA community 17d ago

Evalatro: an open benchmark where LLMs play the real Balatro

Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game. It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics. Then the idea grew into something…

21
r/LocalLLaMA community 17d ago

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or…

34
arXiv — Machine Learning research 17d ago

Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

arXiv:2606.14245v1 Announce Type: new Abstract: Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability…

33
arXiv — Machine Learning research 17d ago

Can Deep Neural Networks Improve Compression of Very Large Scientific Data?

arXiv:2606.14353v1 Announce Type: new Abstract: Error-bounded lossy compression is a fundamental technique for managing the rapidly growing volumes of scientific data produced by modern simulations and observational instruments. Most state-of-the-art-compressors follow a…

36
arXiv — Machine Learning research 17d ago

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications…

5
arXiv — Machine Learning research 17d ago

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating…

38
arXiv — NLP / Computation & Language research 17d ago

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks…

29
arXiv — NLP / Computation & Language research 17d ago

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce…

25
arXiv — NLP / Computation & Language research 17d ago

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

arXiv:2606.13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this…

10
arXiv — NLP / Computation & Language research 17d ago

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

arXiv:2606.14391v1 Announce Type: new Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior…

15
arXiv — NLP / Computation & Language research 17d ago

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

arXiv:2606.14459v1 Announce Type: new Abstract: Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech…

6
arXiv — NLP / Computation & Language research 17d ago

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

arXiv:2606.14574v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type…

8
arXiv — NLP / Computation & Language research 17d ago

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

arXiv:2606.14600v1 Announce Type: new Abstract: Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce…

32
arXiv — NLP / Computation & Language research 17d ago

WorkBench Revisited: Workplace Agents Two Years On

arXiv:2606.13715v1 Announce Type: cross Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best…

34
arXiv — NLP / Computation & Language research 17d ago

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

arXiv:2606.13815v1 Announce Type: cross Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the…

37
arXiv — NLP / Computation & Language research 17d ago

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where…

4
arXiv — NLP / Computation & Language research 17d ago

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We…

23
arXiv — NLP / Computation & Language research 17d ago

FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

arXiv:2508.05782v2 Announce Type: replace Abstract: Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a…

29
arXiv — NLP / Computation & Language research 17d ago

Residual Context Diffusion Language Models

arXiv:2601.22954v2 Announce Type: replace Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a…

7
arXiv — NLP / Computation & Language research 17d ago

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

arXiv:2603.05167v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce…

20
r/LocalLLaMA community 18d ago

Gemma 4 models benchmarked on with Triple GPU

Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo…

29
r/LocalLLaMA community 18d ago

Quality evaluation of quants with limited time or tokens

About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3…

36
r/LocalLLaMA community 18d ago

Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192)

First of all shout out to Aiden/Antirez & geniuses at the Nvidia community threads. I'm merely claude-vibing off of their works. That a said, i thought i'd share recipes & learnings & benchmarks so far on running big MOE models on two dgx sparks at a reasonable speed for agent…

14
r/LocalLLaMA community 19d ago

I don’t know who needs to hear this but 128GB BD-R XL M-DISC is SOTA for consumer-available archival optical storage (for backing up your models)

If you’re trying to download and preserve your local LLMs in case of future availability issues due to AI-related politics, your best bet is either 128gb or 100gb Blu-Ray optical disks, more specifically BD-R XL M-DISC standard format which are archival-grade and built to last…

21
r/LocalLLaMA community 19d ago

GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X.

The model now supports a 1M context window and two thinking modes: max and high. z.ai recommends using max for coding. Vote on X What should we prioritize most? Longer context window MIT-licensed open weights No price increase Other links: GLM 5.2 announcement LLM Benchmark…

32
r/LocalLLaMA community 19d ago

Diffusion Gemma is 4x faster, but makes 6x more mistakes!

Benchmarked the new Gemma diffusion model against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we…

14
NVIDIA Developer Blog official-blog 19d ago

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how...

8
Hugging Face Daily Papers research 20d ago

The Cold-Start Safety Gap in LLM Agents

Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe…

37
Hugging Face Daily Papers research 20d ago

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Abstract Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension. Generated by…

27
Smol AI News news-outlet 20d ago

not much happened today

**Anthropic** suspended access to **Claude Fable 5** and **Mythos 5** due to **US export controls**, sparking a debate on **model sovereignty** and geopolitical risks for frontier AI vendors. **Artificial Analysis** updated its coding agent benchmark, replacing **SWE-Bench Pro**…

17
arXiv — NLP / Computation & Language research 20d ago

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

arXiv:2606.12608v1 Announce Type: new Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping…

8
arXiv — NLP / Computation & Language research 20d ago

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

arXiv:2606.12789v1 Announce Type: new Abstract: Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present…

22
arXiv — NLP / Computation & Language research 20d ago

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global…

19
arXiv — NLP / Computation & Language research 20d ago

Polar: A Benchmark for Evaluating Political Bias in LLMs

arXiv:2606.12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that…

28
arXiv — NLP / Computation & Language research 20d ago

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most…

23
arXiv — NLP / Computation & Language research 20d ago

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

arXiv:2606.13111v1 Announce Type: new Abstract: We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public…

30
arXiv — NLP / Computation & Language research 20d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to…

26
arXiv — NLP / Computation & Language research 20d ago

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

arXiv:2606.13647v1 Announce Type: new Abstract: We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual…

25
arXiv — NLP / Computation & Language research 20d ago

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to…

30
arXiv — NLP / Computation & Language research 20d ago

SupraBench: A Benchmark for Supramolecular Chemistry

arXiv:2606.13477v1 Announce Type: cross Abstract: Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per…

38
Hugging Face Daily Papers research 20d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search…

26

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VisualClaw: A Real-Time, Personalized Agent for the Physical World

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

Evalatro: an open benchmark where LLMs play the real Balatro

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

Can Deep Neural Networks Improve Compression of Very Large Scientific Data?

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

WorkBench Revisited: Workplace Agents Two Years On

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

Residual Context Diffusion Language Models

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

Gemma 4 models benchmarked on with Triple GPU

Quality evaluation of quants with limited time or tokens

Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192)

I don’t know who needs to hear this but 128GB BD-R XL M-DISC is SOTA for consumer-available archival optical storage (for backing up your models)

GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X.

Diffusion Gemma is 4x faster, but makes 6x more mistakes!

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

The Cold-Start Safety Gap in LLM Agents

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

not much happened today

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

Polar: A Benchmark for Evaluating Political Bias in LLMs

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

SupraBench: A Benchmark for Supramolecular Chemistry

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge