News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow Hugging Face Daily Papers research 1d ago OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks Abstract OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing computer-use benchmarks… 24 r/MachineLearning community 1d ago REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]   submitted by   /u/julian88888888 [link]   [comments] 13 Hugging Face official-blog 1d ago ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration Back to Articles a]:hidden"> ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration Enterprise Article Published June 30, 2026 Upvote - Raju Pavuluri rpavuluri ibm-research Rahul Krishna rkrsn ibm-research Srikanth Govindaraj Tamilselvam stamilse ibm-research… 13 Hugging Face Daily Papers research 1d ago SWE-Together: Evaluating Coding Agents in Interactive User Sessions Abstract SWE-Together is a multi-turn coding benchmark created from real user-agent interactions, featuring a reactive LLM simulator to evaluate agents based on both final correctness and interaction efficiency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Most coding-agent… 32 r/LocalLLaMA community 1d ago Benchmarked Graph-RAG vs. Graph-Free Multi-Hop RAG: The graph mostly bought us a massive rebuild bill, not accuracy. We kept hitting the same wall building multi-hop RAG: the systems with the best accuracy (GraphRAG, HippoRAG 2, RAPTOR) all lean on a knowledge graph built offline - and that’s great numbers, until the moment your data changes! Every update means re-running an LLM indexing pass… 11 r/LocalLLaMA community 1d ago I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy Been running agents locally for a while and kept hitting the same issue: the more tools I added, the worse the model got at picking the right one.. So I finally benchmarked it properly.. Setup: qwen3.5-class model on an M4 MacBook, 100 tools in the catalog. One run with the full… 23 r/LocalLLaMA community 1d ago Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090 First of all, a huge thank you to the r/LocalLLaMA community and the 3090 club. This benchmark started from your shared recipes... These are my findings on my hardware (Xeon E5-2666v3, 64GB RAM, single RTX 3090 24GB) comparing 5 engines (3 llama.cpp forks + mainline + Lucebox)… 12 r/LocalLLaMA community 1d ago Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between… 4 Hugging Face Daily Papers research 1d ago ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval Abstract A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.… 32 Hugging Face Daily Papers research 1d ago Beyond Drug Discovery: The Nanotechnology Molecular Optimization (NMO) Benchmark Abstract The Nanotechnology Molecular Optimization (NMO) Benchmark introduces physics-based molecular design challenges that require new generative model approaches, moving beyond drug-discovery-focused metrics to enable scientific discovery in nanotechnology. Generated by… 24 r/LocalLLaMA community 2d ago Tesla V100 16GB local LLMs, single and dual NVLink benchmarks Picked up a couple of Tesla V100-SXM2-16GB modules a while back to run local models and drive Claude Code fully offline, figured the actual numbers and the traps might save someone else the pain. They've come right down in price and the 16GB of HBM2 at ~900 GB/s still holds up… 33 r/LocalLLaMA community 2d ago InternScience/Agents-A1 · Hugging Face Unbelievable benchmarks for a 35B MoE, somebody verify. Here is tech report btw: https://arxiv.org/pdf/2606.30616   submitted by   /u/mlon_eusk-_- [link]   [comments] 23 Hugging Face Daily Papers research 2d ago TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents Abstract TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents. Generated by… 4 Hugging Face Daily Papers research 2d ago Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning Abstract A new benchmark evaluates multimodal large language models' ability to reason over dynamic visual evidence through controlled temporal-logical operations rather than simple object recognition. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent interest in multimodal… 25 Hugging Face Daily Papers research 2d ago Trimming the Long-Tail of Visual World Modeling Evaluation Abstract Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Physical interactions follow a… 28 arXiv — Machine Learning research 2d ago Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing… 36 arXiv — Machine Learning research 2d ago Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy arXiv:2606.28433v1 Announce Type: new Abstract: One goal in reinforcement learning (RL) research is to understand general-purpose sequential decision-making, using benchmark simulators as a proxy for learning in deployment settings. When running experiments, however, the goal of… 5 arXiv — Machine Learning research 2d ago Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived… 16 arXiv — Machine Learning research 2d ago Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using… 27 arXiv — Machine Learning research 2d ago KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory arXiv:2606.29243v1 Announce Type: new Abstract: We present KrishokChat, the first citation-grounded Bengali agricultural instruction-tuning dataset for crop advisory in low-resource settings. We establish a foundation of 290 hierarchical Knowledge Nodes, extracting disease… 30 arXiv — NLP / Computation & Language research 2d ago Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks arXiv:2606.29082v1 Announce Type: new Abstract: Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on… 4 arXiv — NLP / Computation & Language research 2d ago Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study arXiv:2606.29213v1 Announce Type: new Abstract: OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts… 30 arXiv — NLP / Computation & Language research 2d ago mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health arXiv:2606.29467v1 Announce Type: new Abstract: Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health… 25 arXiv — NLP / Computation & Language research 2d ago Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs arXiv:2606.29534v1 Announce Type: new Abstract: Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a… 23 arXiv — NLP / Computation & Language research 2d ago Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against… 8 arXiv — NLP / Computation & Language research 2d ago How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation arXiv:2606.29809v1 Announce Type: new Abstract: Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating… 27 arXiv — NLP / Computation & Language research 2d ago SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models arXiv:2606.29815v1 Announce Type: new Abstract: Evaluating code large language models (Code LLMs) requires reliable detection of data leakage, where benchmark performance is artificially inflated by exposure to benchmark data during pre-training. Existing approaches either… 7 arXiv — NLP / Computation & Language research 2d ago Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical… 10 arXiv — NLP / Computation & Language research 2d ago Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization arXiv:2606.29933v1 Announce Type: new Abstract: The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and… 16 Hugging Face Daily Papers research 2d ago SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing Abstract SafePyramid benchmark evaluates guardrail systems' ability to identify safety violations through in-context policy specification across multiple domains and complexity levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world applications, guardrails are often… 5 Hugging Face Daily Papers research 2d ago Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction Abstract A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks. Generated by… 28 r/LocalLLaMA community 2d ago Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.… 19 OpenAI official-blog 2d ago Introducing GeneBench-Pro Introducing GeneBench-Pro, a new benchmark testing AI performance in genomics, biology, and scientific research using complex, real-world datasets. 22 OpenAI official-blog 2d ago Inside Genebench-Pro June 30, 2026 Inside Genebench-Pro A closer look at the benchmark, its questions, and supporting materials. Case studies These 10 case studies showcase representative questions from GeneBench-Pro. Each case study includes the original prompt, datasets, and supporting materials.… 35 TechCrunch — AI news-outlet 2d ago Arena, the AI leaderboard everyone uses, is now a $100M business The startup, which runs a popular free AI leaderboard, launched its commercial service just last September. 23 r/MachineLearning community 2d ago Adaptive Mixture of Experts Gate (AMG) [R] [Project] Post-hoc Adaptive MoE Gating on Qwen3.6-35B — empirical benchmarking of an open research gap Adaptive MoE routing — selecting a variable number of experts per token based on routing confidence — has been studied in papers (XMoE 2024, DynMoE ICLR 2025, TopP routing… 5 arXiv — Machine Learning research 3d ago Learning in Markovian bandits with non-observable states and constrained decision epochs arXiv:2606.27448v1 Announce Type: new Abstract: This paper studies the problem of regret minimization in Markovian bandits with \emph{non-observable states} and possibly \emph{constrained} decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the… 26 arXiv — Machine Learning research 3d ago Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets… 21 arXiv — NLP / Computation & Language research 3d ago Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs arXiv:2606.27378v1 Announce Type: new Abstract: We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks.… 29 arXiv — NLP / Computation & Language research 3d ago Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval arXiv:2606.27401v1 Announce Type: cross Abstract: Semantic code search and clone detection are essential for software development, maintenance, and reuse. This paper evaluates the effectiveness, efficiency, and scalability of contemporary deep learning models for first-stage… 35 arXiv — Machine Learning research 3d ago Benchmarking Multi-Modal Graph-based Social Media Popularity Prediction arXiv:2606.27539v1 Announce Type: cross Abstract: Social media popularity prediction aims to forecast the future reach or influence of online content from early-stage observations. Accurate prediction enables key downstream applications, such as advertising optimization and… 25 arXiv — NLP / Computation & Language research 3d ago Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated,… 32 arXiv — NLP / Computation & Language research 3d ago When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume… 27 arXiv — NLP / Computation & Language research 3d ago CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models arXiv:2606.27383v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used as research assistants, yet it remains unclear whether they can calibrate research takeaways to the strength and scope of the supporting evidence. We study evidence-calibrated… 17 arXiv — NLP / Computation & Language research 3d ago DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write… 11 arXiv — NLP / Computation & Language research 3d ago LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks arXiv:2604.13072v2 Announce Type: replace Abstract: OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments. Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be… 28 Hacker News — AI on Front Page community 3d ago GLM 5.2 beats Claude in our benchmarks Article URL: https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/ Comments URL: https://news.ycombinator.com/item?id=48709670 Points: 273 # Comments: 109 22 r/LocalLLaMA community 3d ago Are there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it? hey, I’m currently getting enough VRAM to run something in the GLM-5.2 range, but I’m wondering: do we actually have a solid ranking that compares closed-source and open-weight LLMs side by side? I’ve been trying to find a clear “closed vs open” leaderboard, but most benchmarks… 26 r/LocalLLaMA community 4d ago Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"? After spending countless hours testing on 3 "potato" laptops (Intel i3, 8GB RAM, Win11, integrated GPU), that's my conclusion. For reliably extracting data from images to JSON on low-end hardware, nothing else even comes close. Yet, it’s completely missing from major benchmarks… 23 r/LocalLLaMA community 4d ago US Ban Benchmark Updated: Toe-to-toe Between Two Big Names! OpenAI ties with Anthropic in this benchmark following the preview of GPT 5.6 just yesterday. Chinese models have no hope of catching up forever, while Gemini's figure is yet to be updated.   submitted by   /u/Complete-Sea6655 [link]   [comments] 30 Page 2 of 10 · 500 articles ← Newer Older →