News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — NLP / Computation & Language research 24d ago UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in… 33 arXiv — NLP / Computation & Language research 24d ago An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection arXiv:2606.06879v1 Announce Type: new Abstract: Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features… 12 arXiv — NLP / Computation & Language research 24d ago OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios arXiv:2606.06959v1 Announce Type: new Abstract: Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of… 5 arXiv — NLP / Computation & Language research 24d ago Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments arXiv:2606.06960v1 Announce Type: new Abstract: Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit… 12 arXiv — NLP / Computation & Language research 24d ago MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights arXiv:2606.07020v1 Announce Type: new Abstract: Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis.… 19 arXiv — NLP / Computation & Language research 24d ago mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages? arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require… 19 arXiv — NLP / Computation & Language research 24d ago UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We… 37 arXiv — NLP / Computation & Language research 24d ago M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic… 19 arXiv — NLP / Computation & Language research 24d ago How reliable are LLMs when it comes to playing dice? arXiv:2606.07515v1 Announce Type: new Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a… 33 arXiv — NLP / Computation & Language research 24d ago MMAE: A Massive Multitask Audio Editing Benchmark arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,… 8 arXiv — NLP / Computation & Language research 24d ago SWE-Explore: Benchmarking How Coding Agents Explore Repositories arXiv:2606.07297v1 Announce Type: cross Abstract: Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved),… 10 arXiv — NLP / Computation & Language research 24d ago The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders? arXiv:2606.07435v1 Announce Type: cross Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI… 28 Vercel — AI dev-tools 24d ago DeepSeek enters the fight for token volume, Anthropic continues to dominate spend Every month, AI Gateway routes tens of trillions of tokens between production applications and AI labs, giving us visibility into what AI usage actually looks like, separate from leaderboards and benchmarks. We publish the data monthly in the AI Gateway production index. May… 18 Hugging Face Daily Papers research 24d ago PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams Abstract PaperFlow is a framework for scientific paper recommendation that processes user profiles, daily paper streams, and interest drift through three stages: profiling, recommending, and adapting, using a longitudinal benchmark with 24 users, 50 daily streams, and 1,200… 19 Hugging Face Daily Papers research 24d ago SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents Abstract SubtleMemory benchmark evaluates AI agents' ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships. Generated by… 33 Hugging Face Daily Papers research 24d ago When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents Abstract ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing… 12 Hugging Face Daily Papers research 24d ago WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world… 11 Hugging Face Daily Papers research 24d ago OpenSkill: Open-World Self-Evolution for LLM Agents Abstract OpenSkill enables self-evolving agents to develop skills and verification signals from scratch using open-world resources without target-task supervision, achieving high automated performance across benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Self-evolving… 30 Hugging Face Daily Papers research 24d ago dots.tts Technical Report Abstract A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques. Generated by… 32 r/LocalLLaMA community 25d ago Qwen 3.6 27B on DeepSWE Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar to 3.6 Plus and it really gets me… 21 r/LocalLLaMA community 25d ago Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks . BeeLlama.cpp (my llama.cpp fork) was used as inference engine due to support of additional types:… 31 r/LocalLLaMA community 25d ago Gemma 4 31B QAT Q4 vs standard Q4 — Top1 KLD benchmark results have me confused. Someone please explain or poke holes in this. Edited - After digging into this some more and reviewing unsloth post for better understanding, the divergence APPEARS to stem from I did not use the BF16 QAT model as the "reference" model.... The QAT vs standard Q4 comparison in our benchmark is not apples-to-apples . The QAT… 11 r/LocalLLaMA community 26d ago AMD MI50 on Debian Testing is doing great and getting better. There is probably some relevant information to other cards here but my benchmarks are on dual MI50 32GB cards because that is what I have, and thought I would share with the community. Install instructions at the end. I'll put a dump of the full llama-benchy tables in a comment… 21 r/LocalLLaMA community 26d ago 120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the… 17 r/LocalLLaMA community 26d ago KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive! TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher. A number of people in the comments under my previous post asked a fair question: what if we… 21 r/LocalLLaMA community 27d ago Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought. Namely for Honcho workload tiers and differing cron jobs. Not every workload benefits from an… 35 r/LocalLLaMA community 27d ago dots.tts 2B🎙️ SOTA TTS from RedNote 🔗 Blog: https://rednote-hilab.github.io/dots.tts-demo/ 🔗 GitHub: https://github.com/rednote-hilab/dots.tts 🔗 Technical Report: https://arxiv.org/abs/2608.16894 dots.tts 🎙️ New open-source TTS from RedNote (Xiaohongshu) ✨ 2B parameters (Apache 2.0) ✨ Fully continuous… 16 Hugging Face Daily Papers research 27d ago SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models… 38 Hugging Face Daily Papers research 27d ago Benchmark Everything Everywhere All at Once Abstract Automated benchmark creation system generates diverse evaluation datasets with minimal human intervention, enabling continuous model assessment across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Benchmarks are fundamental for evaluating and advancing… 27 r/LocalLLaMA community 27d ago I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM… 12 r/MachineLearning community 27d ago Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D] Sharing a small CPU inference benchmark for nvidia/parakeet-tdt-0.6b-v3 that turned up a result I didn't expect going in. Setup: 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU. Test audio: 16.78s Harvard sentences at 16kHz mono. Results: Inference path RTF Peak Memory CPU… 26 r/MachineLearning community 27d ago An autonomous research agent was the #1 contributor in OpenAI's Hiring Competition Parameter Golf (by merged records)[R] https://preview.redd.it/kucy7n6nrg5h1.png?width=1600&format=png&auto=webp&s=b1c2e537667fbca3d1736fc103296c7374270d9c An autonomous research agent ended up with more merged leaderboard records than any individual human contributor in OpenAI's spring hiring competition, Parameter… 27 Hugging Face Daily Papers research 27d ago ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment Abstract ForeSci is a temporally controlled benchmark that evaluates LLM agents' ability to make forward-looking research decisions from historical evidence across fast-moving AI domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI research often requires decisions before… 4 r/LocalLLaMA community 27d ago Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside) I completed a Python bug hunting benchmark with Gemma 4 12B. I used the Unsloth Dynamic Q5 GGUF model. The model has good capabilities. Default settings in LM Studio disable the reasoning. Fix the LM Studio reasoning configuration. LM Studio looks for Qwen tokens. Gemma 4 uses… 30 Hugging Face Daily Papers research 27d ago Towards One-to-Many Temporal Grounding Abstract One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 11 Hugging Face Daily Papers research 27d ago MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding Abstract Mechanical engineering drawing understanding is improved through a specialized dataset and domain-specific model that outperforms existing baselines by leveraging multi-stage training and high-density visual question answering annotations. Generated by… 9 r/MachineLearning community 27d ago Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? [d] Hello everyone, Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? I am working on a project idea related to library-specific code generation. The concrete case is a specific Python library used in a… 18 Smol AI News news-outlet 27d ago not much happened today **Anthropic's Mythos/Opus cycle** sparked mixed reactions with praise for **Claude Mythos**'s one-shot workflows and concerns over **Opus 4.8** benchmark regressions. **Opus 4.7** showed strong chemistry task performance, "making Claude a chemist." **Sakana AI** launched an… 23 Hugging Face Daily Papers research 27d ago AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints Abstract AdaPlanBench presents a dynamic interactive benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Planning for real-world… 18 Hugging Face Daily Papers research 27d ago RobotValues: Evaluating Household Robots When Human Values Conflict Abstract RobotValues benchmark evaluates household robot planners in value-conflict scenarios, revealing that vision-language models exhibit default value preferences and struggle to override them when instructed to prioritize conflicting values. Generated by… 8 arXiv — Machine Learning research 27d ago The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models arXiv:2606.05169v1 Announce Type: new Abstract: We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by… 30 arXiv — Machine Learning research 27d ago ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models arXiv:2606.05170v1 Announce Type: new Abstract: At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all… 27 arXiv — Machine Learning research 27d ago Flash-WAM: Modality-Aware Distillation for World Action Models arXiv:2606.05254v1 Announce Type: new Abstract: World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time… 13 arXiv — Machine Learning research 27d ago Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions arXiv:2606.05692v1 Announce Type: new Abstract: Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on… 35 arXiv — NLP / Computation & Language research 27d ago Generic Triple-Latent Compression with Gated Associative Retrieval arXiv:2606.05175v1 Announce Type: new Abstract: We study generic triple-latent sequence models that maintain a running token state and compressed pair-memory pathway to capture higher-order token interactions without benchmark-specific parsing. The triple-latent family improves… 6 arXiv — NLP / Computation & Language research 27d ago MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four… 5 arXiv — NLP / Computation & Language research 27d ago The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial… 20 arXiv — NLP / Computation & Language research 27d ago ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation arXiv:2606.05421v1 Announce Type: new Abstract: When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other,… 6 arXiv — NLP / Computation & Language research 27d ago ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? arXiv:2606.05553v1 Announce Type: new Abstract: Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether… 10 arXiv — NLP / Computation & Language research 27d ago TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not… 32 Page 10 of 10 · 500 articles ← Newer