News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — NLP / Computation & Language research 22d ago Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs arXiv:2606.10852v1 Announce Type: new Abstract: LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise… 16 arXiv — NLP / Computation & Language research 22d ago T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains arXiv:2606.11070v1 Announce Type: new Abstract: Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain… 15 arXiv — NLP / Computation & Language research 22d ago VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation arXiv:2606.11079v1 Announce Type: new Abstract: Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to… 14 arXiv — NLP / Computation & Language research 22d ago PhantomBench: Benchmarking the Non-existential Threat of Language Models arXiv:2606.11105v1 Announce Type: new Abstract: Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such… 8 arXiv — NLP / Computation & Language research 22d ago $\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems arXiv:2606.10156v1 Announce Type: cross Abstract: As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce… 11 arXiv — NLP / Computation & Language research 22d ago RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning arXiv:2606.10254v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined.… 19 arXiv — NLP / Computation & Language research 22d ago Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations arXiv:2606.10281v1 Announce Type: cross Abstract: This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four… 12 arXiv — NLP / Computation & Language research 22d ago Advancing the State-of-the-Art in Empirical Privacy Auditing arXiv:2606.10481v1 Announce Type: cross Abstract: Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on… 23 Hugging Face Daily Papers research 22d ago One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA Abstract Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks. Generated by… 28 Hugging Face Daily Papers research 22d ago BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts Abstract Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 14 Hugging Face official-blog 23d ago Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech Back to Articles Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech Enterprise Article Published June 9, 2026 Upvote 4 Shama Gupta shamagupta ServiceNow-AI Lindsay Brin lindsaybrin ServiceNow-AI Fanny Riols FannyRiols ServiceNow-AI… 11 Hugging Face Daily Papers research 23d ago Agents' Last Exam Abstract Agents' Last Exam (ALE) is a benchmark for evaluating AI agents on long-term, economically valuable real-world tasks across 13 industry clusters with 1K+ tasks, revealing significant gaps between benchmark performance and practical deployment. Generated by… 6 r/LocalLLaMA community 23d ago Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting) Thank you to everyone who contributed to my previous post, providing feedback and various models to add, and questioning the rating system. You can now participate in a live blind voting to create a proper ELO for all the models that are added. Each new model that we add will… 23 r/LocalLLaMA community 23d ago Jetson Orin NX Build for Hermes Agent + Benchmarking I had a huge LLM server , and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days. I figured now with MoE and smaller models doing well, it was time to mess with it again. Goal: As silent as possible… 34 Hugging Face Daily Papers research 23d ago OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning Abstract OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning. Generated by… 5 Hugging Face Daily Papers research 23d ago Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops Abstract Researchers identify widespread vulnerabilities in agent benchmark verification systems and develop an automated iterative process using LLM agents to create robust verifiers that resist exploitation while maintaining legitimate task performance. Generated by… 20 Latent.Space news-outlet 23d ago [AINews] FrontierCode: Benchmarking for Code Quality over Slop We made a thing! 31 Hugging Face Daily Papers research 23d ago PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems Abstract A local benchmark-generation pipeline transforms live property graphs and seed queries into balanced NL-to-Cypher datasets for enterprise knowledge graphs, incorporating schema profiling, reverse-query grounding, and execution validation. Generated by… 22 r/LocalLLaMA community 23d ago Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? I'm trying to find out if anyone has done any benchmarking comparing the Gemma 4 4-bit QAT models (via Unsloth) against standard 8-bit non-QAT quants. I know QAT is supposed to retain a ton of accuracy compared to the baseline BF16, but I'm curious how a 4-bit QAT model actually… 37 Hugging Face Daily Papers research 23d ago OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics Abstract OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 18 r/LocalLLaMA community 23d ago Gemma 4 26B A4B IT QAT Comparison Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me. I did not use any AI other than asking Gemini 3.1 Pro if it was statistically significant because I was too tired to do… 31 arXiv — Machine Learning research 23d ago Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark arXiv:2606.07550v1 Announce Type: new Abstract: Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction… 35 arXiv — Machine Learning research 23d ago ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research arXiv:2606.07591v1 Announce Type: new Abstract: AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research… 14 arXiv — Machine Learning research 23d ago LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training arXiv:2606.07610v1 Announce Type: new Abstract: State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful… 6 arXiv — Machine Learning research 23d ago Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models arXiv:2606.07623v1 Announce Type: new Abstract: This paper develops a model-theoretic framework for verifying context-conditioned language-model behavior by replacing benchmark labels with finite semantic certificates. The first problem is finite determinacy: when do examples in… 25 arXiv — Machine Learning research 23d ago Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no… 13 arXiv — Machine Learning research 23d ago A Framework for Evaluating and Benchmarking Concept Drift Detection Methods arXiv:2606.07789v1 Announce Type: new Abstract: Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent… 26 Hugging Face Daily Papers research 23d ago Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key… 20 Hugging Face Daily Papers research 23d ago CoVEBench: Can Video Editing Models Handle Complex Instructions? Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by… 19 Hugging Face Daily Papers research 23d ago SWE-Explore: Benchmarking How Coding Agents Explore Repositories Abstract SWE-Explore introduces a benchmark for evaluating coding agents' repository exploration capabilities by requiring ranked lists of relevant code regions within line budgets, demonstrating that agentic exploration outperforms traditional retrieval methods. Generated by… 11 Hugging Face Daily Papers research 23d ago SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a… 7 r/LocalLLaMA community 23d ago I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The… 14 r/LocalLLaMA community 24d ago Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling , and u/complexminded pointed out the tool-eval-bench utility by… 9 r/LocalLLaMA community 24d ago LocalLLaMA post tier list Since there is much (justified) whining about post quality, I thought it would be helpful to get a sense of what people actually DO like. Here's my take: S-tier: -GGUFs/MLX or benchmark data for new best-in-class local model released - New Optimizations that are actually a big… 17 r/LocalLLaMA community 24d ago When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking   submitted by   /u/Honest-Kangaroo-1830 [link]   [comments] 12 Hugging Face Daily Papers research 24d ago UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs Abstract UnpredictaBench evaluates large language models' capacity to sample from target distributions, revealing significant gaps in their ability to simulate unpredictable systems despite recent advances in output diversity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We… 7 r/LocalLLaMA community 24d ago [Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative decoding combined with KV cache… 20 Hugging Face Daily Papers research 24d ago GENEB: Why Genomic Models Are Hard to Compare Abstract GENEB presents a comprehensive benchmark for evaluating genomic foundation models across diverse tasks and architectures under a unified protocol. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in genomic foundation models is difficult to assess due to fragmented… 25 Hugging Face Daily Papers research 24d ago SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated… 30 Smol AI News news-outlet 24d ago not much happened today **FrontierCode** benchmark by **Cognition** highlights the challenge of coding tasks with the best model, **Opus 4.8**, scoring only about **13%** on the hardest subset, indicating coding is less solved than benchmarks suggest. The trend toward using **loops** as a control… 5 Hugging Face Daily Papers research 24d ago MMAE: A Massive Multitask Audio Editing Benchmark Abstract MMAE presents a comprehensive benchmark for instruction-based audio editing across multiple modalities and complexity levels, revealing significant gaps in current model capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce MMAE, a Massive Multitask… 24 arXiv — Machine Learning research 24d ago Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly… 27 arXiv — Machine Learning research 24d ago MacArena: Benchmarking Computer Use Agents on an Online macOS Environment arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,… 37 arXiv — Machine Learning research 24d ago ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets arXiv:2606.06717v1 Announce Type: new Abstract: While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets,… 32 arXiv — Machine Learning research 24d ago GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting arXiv:2606.06881v1 Announce Type: new Abstract: Blood glucose forecasting models are foundational for modern diabetes management systems, as reliable short-term predictions can enable proactive interventions, support automated insulin delivery, and reduce the risk of hypo- and… 38 arXiv — Machine Learning research 24d ago The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning arXiv:2606.06920v1 Announce Type: new Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)… 17 arXiv — Machine Learning research 24d ago REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or… 12 arXiv — Machine Learning research 24d ago Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation arXiv:2606.07387v1 Announce Type: new Abstract: State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose… 15 arXiv — Machine Learning research 24d ago CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations arXiv:2606.07488v1 Announce Type: new Abstract: Personalized virtual heart simulations face challenges in model personalization and computational cost. While neural surrogates offer state-of-the-art solutions, they typically address either efficient personalization or training… 28 arXiv — Machine Learning research 24d ago Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction arXiv:2606.06509v1 Announce Type: cross Abstract: Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically… 17 Page 9 of 10 · 500 articles ← Newer Older →