News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — NLP / Computation & Language research 16d ago A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning… 30 arXiv — NLP / Computation & Language research 16d ago Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design arXiv:2606.16009v1 Announce Type: new Abstract: Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains… 23 arXiv — NLP / Computation & Language research 16d ago Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs arXiv:2606.16011v1 Announce Type: new Abstract: Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a… 28 arXiv — NLP / Computation & Language research 16d ago AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We… 33 arXiv — NLP / Computation & Language research 16d ago GRACE: Step-Level Benchmark for Faithful Reasoning over Context arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can… 15 arXiv — NLP / Computation & Language research 16d ago Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework arXiv:2606.16211v1 Announce Type: new Abstract: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However,… 36 Hugging Face Daily Papers research 16d ago VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models Abstract VibeThinker-3B demonstrates that compact models can achieve state-of-the-art performance on verifiable reasoning tasks through specialized training techniques, challenging conventional scaling assumptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This technical… 16 Hugging Face Daily Papers research 16d ago VisualClaw: A Real-Time, Personalized Agent for the Physical World Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as… 32 r/LocalLLaMA community 16d ago HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...) Link to last post Before anything else, I'd like to sincerely thank u/jipok_ for helping out by highlighting a few weak questions, categories and scoring issues, which have now been addressed (Dropping >100 questions, tuning the scoring methodology for more accuracy, etc).… 19 r/LocalLLaMA community 17d ago Evalatro: an open benchmark where LLMs play the real Balatro Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game. It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics. Then the idea grew into something… 21 r/LocalLLaMA community 17d ago I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or… 34 arXiv — Machine Learning research 17d ago Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability arXiv:2606.14245v1 Announce Type: new Abstract: Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability… 33 arXiv — Machine Learning research 17d ago Can Deep Neural Networks Improve Compression of Very Large Scientific Data? arXiv:2606.14353v1 Announce Type: new Abstract: Error-bounded lossy compression is a fundamental technique for managing the rapidly growing volumes of scientific data produced by modern simulations and observational instruments. Most state-of-the-art-compressors follow a… 36 arXiv — Machine Learning research 17d ago Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications… 5 arXiv — Machine Learning research 17d ago EM-NeSy: Expectation Maximization for Neurosymbolic Learning arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating… 38 arXiv — NLP / Computation & Language research 17d ago The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks… 29 arXiv — NLP / Computation & Language research 17d ago Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce… 25 arXiv — NLP / Computation & Language research 17d ago Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents arXiv:2606.13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this… 10 arXiv — NLP / Computation & Language research 17d ago Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR arXiv:2606.14391v1 Announce Type: new Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior… 15 arXiv — NLP / Computation & Language research 17d ago MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition arXiv:2606.14459v1 Announce Type: new Abstract: Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech… 6 arXiv — NLP / Computation & Language research 17d ago SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model arXiv:2606.14574v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type… 8 arXiv — NLP / Computation & Language research 17d ago LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations arXiv:2606.14600v1 Announce Type: new Abstract: Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce… 32 arXiv — NLP / Computation & Language research 17d ago WorkBench Revisited: Workplace Agents Two Years On arXiv:2606.13715v1 Announce Type: cross Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best… 34 arXiv — NLP / Computation & Language research 17d ago Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs arXiv:2606.13815v1 Announce Type: cross Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the… 37 arXiv — NLP / Computation & Language research 17d ago ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where… 4 arXiv — NLP / Computation & Language research 17d ago MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We… 23 arXiv — NLP / Computation & Language research 17d ago FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification arXiv:2508.05782v2 Announce Type: replace Abstract: Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a… 29 arXiv — NLP / Computation & Language research 17d ago Residual Context Diffusion Language Models arXiv:2601.22954v2 Announce Type: replace Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a… 7 arXiv — NLP / Computation & Language research 17d ago C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning arXiv:2603.05167v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce… 20 r/LocalLLaMA community 18d ago Gemma 4 models benchmarked on with Triple GPU Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo… 29 r/LocalLLaMA community 18d ago Quality evaluation of quants with limited time or tokens About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3… 36 r/LocalLLaMA community 18d ago Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192) First of all shout out to Aiden/Antirez & geniuses at the Nvidia community threads. I'm merely claude-vibing off of their works. That a said, i thought i'd share recipes & learnings & benchmarks so far on running big MOE models on two dgx sparks at a reasonable speed for agent… 14 r/LocalLLaMA community 19d ago I don’t know who needs to hear this but 128GB BD-R XL M-DISC is SOTA for consumer-available archival optical storage (for backing up your models) If you’re trying to download and preserve your local LLMs in case of future availability issues due to AI-related politics, your best bet is either 128gb or 100gb Blu-Ray optical disks, more specifically BD-R XL M-DISC standard format which are archival-grade and built to last… 21 r/LocalLLaMA community 19d ago GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X. The model now supports a 1M context window and two thinking modes: max and high. z.ai recommends using max for coding. Vote on X What should we prioritize most? Longer context window MIT-licensed open weights No price increase Other links: GLM 5.2 announcement LLM Benchmark… 32 r/LocalLLaMA community 19d ago Diffusion Gemma is 4x faster, but makes 6x more mistakes! Benchmarked the new Gemma diffusion model against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we… 14 NVIDIA Developer Blog official-blog 19d ago NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how... 8 Hugging Face Daily Papers research 20d ago The Cold-Start Safety Gap in LLM Agents Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe… 37 Hugging Face Daily Papers research 20d ago ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs Abstract Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension. Generated by… 27 Smol AI News news-outlet 20d ago not much happened today **Anthropic** suspended access to **Claude Fable 5** and **Mythos 5** due to **US export controls**, sparking a debate on **model sovereignty** and geopolitical risks for frontier AI vendors. **Artificial Analysis** updated its coding agent benchmark, replacing **SWE-Bench Pro**… 17 arXiv — NLP / Computation & Language research 20d ago Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants arXiv:2606.12608v1 Announce Type: new Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping… 8 arXiv — NLP / Computation & Language research 20d ago How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation arXiv:2606.12789v1 Announce Type: new Abstract: Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present… 22 arXiv — NLP / Computation & Language research 20d ago LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global… 19 arXiv — NLP / Computation & Language research 20d ago Polar: A Benchmark for Evaluating Political Bias in LLMs arXiv:2606.12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that… 28 arXiv — NLP / Computation & Language research 20d ago LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most… 23 arXiv — NLP / Computation & Language research 20d ago M\"OVE: A Holistic LLM Benchmark for the German Public Sector arXiv:2606.13111v1 Announce Type: new Abstract: We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public… 30 arXiv — NLP / Computation & Language research 20d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to… 26 arXiv — NLP / Computation & Language research 20d ago SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation arXiv:2606.13647v1 Announce Type: new Abstract: We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual… 25 arXiv — NLP / Computation & Language research 20d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to… 30 arXiv — NLP / Computation & Language research 20d ago SupraBench: A Benchmark for Supramolecular Chemistry arXiv:2606.13477v1 Announce Type: cross Abstract: Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per… 38 Hugging Face Daily Papers research 20d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search… 26 Page 7 of 10 · 500 articles ← Newer Older →