Tag

Benchmark

500 articles archived under #benchmark · RSS

arXiv — NLP / Computation & Language research 22d ago

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

arXiv:2606.10852v1 Announce Type: new Abstract: LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise…

16
arXiv — NLP / Computation & Language research 22d ago

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

arXiv:2606.11070v1 Announce Type: new Abstract: Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain…

15
arXiv — NLP / Computation & Language research 22d ago

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

arXiv:2606.11079v1 Announce Type: new Abstract: Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to…

14
arXiv — NLP / Computation & Language research 22d ago

PhantomBench: Benchmarking the Non-existential Threat of Language Models

arXiv:2606.11105v1 Announce Type: new Abstract: Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such…

8
arXiv — NLP / Computation & Language research 22d ago

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

arXiv:2606.10156v1 Announce Type: cross Abstract: As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce…

11
arXiv — NLP / Computation & Language research 22d ago

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv:2606.10254v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined.…

19
arXiv — NLP / Computation & Language research 22d ago

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

arXiv:2606.10281v1 Announce Type: cross Abstract: This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four…

12
arXiv — NLP / Computation & Language research 22d ago

Advancing the State-of-the-Art in Empirical Privacy Auditing

arXiv:2606.10481v1 Announce Type: cross Abstract: Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on…

23
Hugging Face Daily Papers research 22d ago

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Abstract Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks. Generated by…

28
Hugging Face Daily Papers research 22d ago

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Abstract Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

14
Hugging Face official-blog 23d ago

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Back to Articles Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech Enterprise Article Published June 9, 2026 Upvote 4 Shama Gupta shamagupta ServiceNow-AI Lindsay Brin lindsaybrin ServiceNow-AI Fanny Riols FannyRiols ServiceNow-AI…

11
Hugging Face Daily Papers research 23d ago

Agents' Last Exam

Abstract Agents' Last Exam (ALE) is a benchmark for evaluating AI agents on long-term, economically valuable real-world tasks across 13 industry clusters with 1K+ tasks, revealing significant gaps between benchmark performance and practical deployment. Generated by…

6
r/LocalLLaMA community 23d ago

Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)

Thank you to everyone who contributed to my previous post, providing feedback and various models to add, and questioning the rating system. You can now participate in a live blind voting to create a proper ELO for all the models that are added. Each new model that we add will…

23
r/LocalLLaMA community 23d ago

Jetson Orin NX Build for Hermes Agent + Benchmarking

I had a huge LLM server , and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days. I figured now with MoE and smaller models doing well, it was time to mess with it again. Goal: As silent as possible…

34
Hugging Face Daily Papers research 23d ago

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Abstract OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning. Generated by…

5
Hugging Face Daily Papers research 23d ago

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Abstract Researchers identify widespread vulnerabilities in agent benchmark verification systems and develop an automated iterative process using LLM agents to create robust verifiers that resist exploitation while maintaining legitimate task performance. Generated by…

20
Latent.Space news-outlet 23d ago

[AINews] FrontierCode: Benchmarking for Code Quality over Slop

We made a thing!

31
Hugging Face Daily Papers research 23d ago

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Abstract A local benchmark-generation pipeline transforms live property graphs and seed queries into balanced NL-to-Cypher datasets for enterprise knowledge graphs, incorporating schema profiling, reverse-query grounding, and execution validation. Generated by…

22
r/LocalLLaMA community 23d ago

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants?

I'm trying to find out if anyone has done any benchmarking comparing the Gemma 4 4-bit QAT models (via Unsloth) against standard 8-bit non-QAT quants. I know QAT is supposed to retain a ton of accuracy compared to the baseline BF16, but I'm curious how a 4-bit QAT model actually…

37
Hugging Face Daily Papers research 23d ago

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Abstract OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
r/LocalLLaMA community 23d ago

Gemma 4 26B A4B IT QAT Comparison

Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me. I did not use any AI other than asking Gemini 3.1 Pro if it was statistically significant because I was too tired to do…

31
arXiv — Machine Learning research 23d ago

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

arXiv:2606.07550v1 Announce Type: new Abstract: Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction…

35
arXiv — Machine Learning research 23d ago

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

arXiv:2606.07591v1 Announce Type: new Abstract: AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research…

14
arXiv — Machine Learning research 23d ago

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

arXiv:2606.07610v1 Announce Type: new Abstract: State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful…

6
arXiv — Machine Learning research 23d ago

Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models

arXiv:2606.07623v1 Announce Type: new Abstract: This paper develops a model-theoretic framework for verifying context-conditioned language-model behavior by replacing benchmark labels with finite semantic certificates. The first problem is finite determinacy: when do examples in…

25
arXiv — Machine Learning research 23d ago

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no…

13
arXiv — Machine Learning research 23d ago

A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

arXiv:2606.07789v1 Announce Type: new Abstract: Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent…

26
Hugging Face Daily Papers research 23d ago

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key…

20
Hugging Face Daily Papers research 23d ago

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by…

19
Hugging Face Daily Papers research 23d ago

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Abstract SWE-Explore introduces a benchmark for evaluating coding agents' repository exploration capabilities by requiring ranked lists of relevant code regions within line budgets, demonstrating that agentic exploration outperforms traditional retrieval methods. Generated by…

11
Hugging Face Daily Papers research 23d ago

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a…

7
r/LocalLLaMA community 23d ago

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The…

14
r/LocalLLaMA community 24d ago

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling , and u/complexminded pointed out the tool-eval-bench utility by…

9
r/LocalLLaMA community 24d ago

LocalLLaMA post tier list

Since there is much (justified) whining about post quality, I thought it would be helpful to get a sense of what people actually DO like. Here's my take: S-tier: -GGUFs/MLX or benchmark data for new best-in-class local model released - New Optimizations that are actually a big…

17
r/LocalLLaMA community 24d ago

When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking

  submitted by   /u/Honest-Kangaroo-1830 [link]   [comments]

12
Hugging Face Daily Papers research 24d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Abstract UnpredictaBench evaluates large language models' capacity to sample from target distributions, revealing significant gaps in their ability to simulate unpredictable systems despite recent advances in output diversity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

7
r/LocalLLaMA community 24d ago

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative decoding combined with KV cache…

20
Hugging Face Daily Papers research 24d ago

GENEB: Why Genomic Models Are Hard to Compare

Abstract GENEB presents a comprehensive benchmark for evaluating genomic foundation models across diverse tasks and architectures under a unified protocol. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in genomic foundation models is difficult to assess due to fragmented…

25
Hugging Face Daily Papers research 24d ago

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated…

30
Smol AI News news-outlet 24d ago

not much happened today

**FrontierCode** benchmark by **Cognition** highlights the challenge of coding tasks with the best model, **Opus 4.8**, scoring only about **13%** on the hardest subset, indicating coding is less solved than benchmarks suggest. The trend toward using **loops** as a control…

5
Hugging Face Daily Papers research 24d ago

MMAE: A Massive Multitask Audio Editing Benchmark

Abstract MMAE presents a comprehensive benchmark for instruction-based audio editing across multiple modalities and complexity levels, revealing significant gaps in current model capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce MMAE, a Massive Multitask…

24
arXiv — Machine Learning research 24d ago

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly…

27
arXiv — Machine Learning research 24d ago

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,…

37
arXiv — Machine Learning research 24d ago

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

arXiv:2606.06717v1 Announce Type: new Abstract: While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets,…

32
arXiv — Machine Learning research 24d ago

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

arXiv:2606.06881v1 Announce Type: new Abstract: Blood glucose forecasting models are foundational for modern diabetes management systems, as reliable short-term predictions can enable proactive interventions, support automated insulin delivery, and reduce the risk of hypo- and…

38
arXiv — Machine Learning research 24d ago

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv:2606.06920v1 Announce Type: new Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)…

17
arXiv — Machine Learning research 24d ago

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or…

12
arXiv — Machine Learning research 24d ago

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

arXiv:2606.07387v1 Announce Type: new Abstract: State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose…

15
arXiv — Machine Learning research 24d ago

CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

arXiv:2606.07488v1 Announce Type: new Abstract: Personalized virtual heart simulations face challenges in model personalization and computational cost. While neural surrogates offer state-of-the-art solutions, they typically address either efficient personalization or training…

28
arXiv — Machine Learning research 24d ago

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

arXiv:2606.06509v1 Announce Type: cross Abstract: Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically…

17

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

PhantomBench: Benchmarking the Non-existential Threat of Language Models

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

Advancing the State-of-the-Art in Empirical Privacy Auditing

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Agents' Last Exam

Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)

Jetson Orin NX Build for Hermes Agent + Benchmarking

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

[AINews] FrontierCode: Benchmarking for Code Quality over Slop

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants?

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Gemma 4 26B A4B IT QAT Comparison

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

CoVEBench: Can Video Editing Models Handle Complex Instructions?

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

LocalLLaMA post tier list

When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

GENEB: Why Genomic Models Are Hard to Compare

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

not much happened today

MMAE: A Massive Multitask Audio Editing Benchmark

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction