Tag

Benchmark

500 articles archived under #benchmark · RSS

Hugging Face Daily Papers research 1d ago

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Abstract OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing computer-use benchmarks…

24
r/MachineLearning community 1d ago

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

  submitted by   /u/julian88888888 [link]   [comments]

13
Hugging Face official-blog 1d ago

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Back to Articles a]:hidden"> ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration Enterprise Article Published June 30, 2026 Upvote - Raju Pavuluri rpavuluri ibm-research Rahul Krishna rkrsn ibm-research Srikanth Govindaraj Tamilselvam stamilse ibm-research…

13
Hugging Face Daily Papers research 1d ago

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Abstract SWE-Together is a multi-turn coding benchmark created from real user-agent interactions, featuring a reactive LLM simulator to evaluate agents based on both final correctness and interaction efficiency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Most coding-agent…

32
r/LocalLLaMA community 1d ago

Benchmarked Graph-RAG vs. Graph-Free Multi-Hop RAG: The graph mostly bought us a massive rebuild bill, not accuracy.

We kept hitting the same wall building multi-hop RAG: the systems with the best accuracy (GraphRAG, HippoRAG 2, RAPTOR) all lean on a knowledge graph built offline - and that’s great numbers, until the moment your data changes! Every update means re-running an LLM indexing pass…

11
r/LocalLLaMA community 1d ago

I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

Been running agents locally for a while and kept hitting the same issue: the more tools I added, the worse the model got at picking the right one.. So I finally benchmarked it properly.. Setup: qwen3.5-class model on an M4 MacBook, 100 tools in the catalog. One run with the full…

23
r/LocalLLaMA community 1d ago

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

First of all, a huge thank you to the r/LocalLLaMA community and the 3090 club. This benchmark started from your shared recipes... These are my findings on my hardware (Xeon E5-2666v3, 64GB RAM, single RTX 3090 24GB) comparing 5 engines (3 llama.cpp forks + mainline + Lucebox)…

12
r/LocalLLaMA community 1d ago

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between…

4
Hugging Face Daily Papers research 1d ago

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Abstract A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.…

32
Hugging Face Daily Papers research 1d ago

Beyond Drug Discovery: The Nanotechnology Molecular Optimization (NMO) Benchmark

Abstract The Nanotechnology Molecular Optimization (NMO) Benchmark introduces physics-based molecular design challenges that require new generative model approaches, moving beyond drug-discovery-focused metrics to enable scientific discovery in nanotechnology. Generated by…

24
r/LocalLLaMA community 2d ago

Tesla V100 16GB local LLMs, single and dual NVLink benchmarks

Picked up a couple of Tesla V100-SXM2-16GB modules a while back to run local models and drive Claude Code fully offline, figured the actual numbers and the traps might save someone else the pain. They've come right down in price and the 16GB of HBM2 at ~900 GB/s still holds up…

33
r/LocalLLaMA community 2d ago

InternScience/Agents-A1 · Hugging Face

Unbelievable benchmarks for a 35B MoE, somebody verify. Here is tech report btw: https://arxiv.org/pdf/2606.30616   submitted by   /u/mlon_eusk-_- [link]   [comments]

23
Hugging Face Daily Papers research 2d ago

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Abstract TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents. Generated by…

4
Hugging Face Daily Papers research 2d ago

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Abstract A new benchmark evaluates multimodal large language models' ability to reason over dynamic visual evidence through controlled temporal-logical operations rather than simple object recognition. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent interest in multimodal…

25
Hugging Face Daily Papers research 2d ago

Trimming the Long-Tail of Visual World Modeling Evaluation

Abstract Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Physical interactions follow a…

28
arXiv — Machine Learning research 2d ago

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing…

36
arXiv — Machine Learning research 2d ago

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

arXiv:2606.28433v1 Announce Type: new Abstract: One goal in reinforcement learning (RL) research is to understand general-purpose sequential decision-making, using benchmark simulators as a proxy for learning in deployment settings. When running experiments, however, the goal of…

5
arXiv — Machine Learning research 2d ago

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived…

16
arXiv — Machine Learning research 2d ago

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using…

27
arXiv — Machine Learning research 2d ago

KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory

arXiv:2606.29243v1 Announce Type: new Abstract: We present KrishokChat, the first citation-grounded Bengali agricultural instruction-tuning dataset for crop advisory in low-resource settings. We establish a foundation of 290 hierarchical Knowledge Nodes, extracting disease…

30
arXiv — NLP / Computation & Language research 2d ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

arXiv:2606.29082v1 Announce Type: new Abstract: Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on…

4
arXiv — NLP / Computation & Language research 2d ago

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

arXiv:2606.29213v1 Announce Type: new Abstract: OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts…

30
arXiv — NLP / Computation & Language research 2d ago

mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

arXiv:2606.29467v1 Announce Type: new Abstract: Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health…

25
arXiv — NLP / Computation & Language research 2d ago

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

arXiv:2606.29534v1 Announce Type: new Abstract: Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a…

23
arXiv — NLP / Computation & Language research 2d ago

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against…

8
arXiv — NLP / Computation & Language research 2d ago

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

arXiv:2606.29809v1 Announce Type: new Abstract: Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating…

27
arXiv — NLP / Computation & Language research 2d ago

SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models

arXiv:2606.29815v1 Announce Type: new Abstract: Evaluating code large language models (Code LLMs) requires reliable detection of data leakage, where benchmark performance is artificially inflated by exposure to benchmark data during pre-training. Existing approaches either…

7
arXiv — NLP / Computation & Language research 2d ago

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical…

10
arXiv — NLP / Computation & Language research 2d ago

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

arXiv:2606.29933v1 Announce Type: new Abstract: The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and…

16
Hugging Face Daily Papers research 2d ago

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

Abstract SafePyramid benchmark evaluates guardrail systems' ability to identify safety violations through in-context policy specification across multiple domains and complexity levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world applications, guardrails are often…

5
Hugging Face Daily Papers research 2d ago

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Abstract A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks. Generated by…

28
r/LocalLLaMA community 2d ago

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.…

19
OpenAI official-blog 2d ago

Introducing GeneBench-Pro

Introducing GeneBench-Pro, a new benchmark testing AI performance in genomics, biology, and scientific research using complex, real-world datasets.

22
OpenAI official-blog 2d ago

Inside Genebench-Pro

June 30, 2026 Inside Genebench-Pro A closer look at the benchmark, its questions, and supporting materials. Case studies These 10 case studies showcase representative questions from GeneBench-Pro. Each case study includes the original prompt, datasets, and supporting materials.…

35
TechCrunch — AI news-outlet 2d ago

Arena, the AI leaderboard everyone uses, is now a $100M business

The startup, which runs a popular free AI leaderboard, launched its commercial service just last September.

23
r/MachineLearning community 2d ago

Adaptive Mixture of Experts Gate (AMG) [R]

[Project] Post-hoc Adaptive MoE Gating on Qwen3.6-35B — empirical benchmarking of an open research gap Adaptive MoE routing — selecting a variable number of experts per token based on routing confidence — has been studied in papers (XMoE 2024, DynMoE ICLR 2025, TopP routing…

5
arXiv — Machine Learning research 3d ago

Learning in Markovian bandits with non-observable states and constrained decision epochs

arXiv:2606.27448v1 Announce Type: new Abstract: This paper studies the problem of regret minimization in Markovian bandits with \emph{non-observable states} and possibly \emph{constrained} decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the…

26
arXiv — Machine Learning research 3d ago

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets…

21
arXiv — NLP / Computation & Language research 3d ago

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

arXiv:2606.27378v1 Announce Type: new Abstract: We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks.…

29
arXiv — NLP / Computation & Language research 3d ago

Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval

arXiv:2606.27401v1 Announce Type: cross Abstract: Semantic code search and clone detection are essential for software development, maintenance, and reuse. This paper evaluates the effectiveness, efficiency, and scalability of contemporary deep learning models for first-stage…

35
arXiv — Machine Learning research 3d ago

Benchmarking Multi-Modal Graph-based Social Media Popularity Prediction

arXiv:2606.27539v1 Announce Type: cross Abstract: Social media popularity prediction aims to forecast the future reach or influence of online content from early-stage observations. Accurate prediction enables key downstream applications, such as advertising optimization and…

25
arXiv — NLP / Computation & Language research 3d ago

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated,…

32
arXiv — NLP / Computation & Language research 3d ago

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume…

27
arXiv — NLP / Computation & Language research 3d ago

CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models

arXiv:2606.27383v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used as research assistants, yet it remains unclear whether they can calibrate research takeaways to the strength and scope of the supporting evidence. We study evidence-calibrated…

17
arXiv — NLP / Computation & Language research 3d ago

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write…

11
arXiv — NLP / Computation & Language research 3d ago

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

arXiv:2604.13072v2 Announce Type: replace Abstract: OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments. Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be…

28
Hacker News — AI on Front Page community 3d ago

GLM 5.2 beats Claude in our benchmarks

Article URL: https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/ Comments URL: https://news.ycombinator.com/item?id=48709670 Points: 273 # Comments: 109

22
r/LocalLLaMA community 3d ago

Are there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it?

hey, I’m currently getting enough VRAM to run something in the GLM-5.2 range, but I’m wondering: do we actually have a solid ranking that compares closed-source and open-weight LLMs side by side? I’ve been trying to find a clear “closed vs open” leaderboard, but most benchmarks…

26
r/LocalLLaMA community 4d ago

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"?

After spending countless hours testing on 3 "potato" laptops (Intel i3, 8GB RAM, Win11, integrated GPU), that's my conclusion. For reliably extracting data from images to JSON on low-end hardware, nothing else even comes close. Yet, it’s completely missing from major benchmarks…

23
r/LocalLLaMA community 4d ago

US Ban Benchmark Updated: Toe-to-toe Between Two Big Names!

OpenAI ties with Anthropic in this benchmark following the preview of GPT 5.6 just yesterday. Chinese models have no hope of catching up forever, while Gemini's figure is yet to be updated.   submitted by   /u/Complete-Sea6655 [link]   [comments]

30

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Benchmarked Graph-RAG vs. Graph-Free Multi-Hop RAG: The graph mostly bought us a massive rebuild bill, not accuracy.

I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Beyond Drug Discovery: The Nanotechnology Molecular Optimization (NMO) Benchmark

Tesla V100 16GB local LLMs, single and dual NVLink benchmarks

InternScience/Agents-A1 · Hugging Face

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Trimming the Long-Tail of Visual World Modeling Evaluation

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Introducing GeneBench-Pro

Inside Genebench-Pro

Arena, the AI leaderboard everyone uses, is now a $100M business

Adaptive Mixture of Experts Gate (AMG) [R]

Learning in Markovian bandits with non-observable states and constrained decision epochs

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval

Benchmarking Multi-Modal Graph-based Social Media Popularity Prediction

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

GLM 5.2 beats Claude in our benchmarks

Are there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it?

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"?

US Ban Benchmark Updated: Toe-to-toe Between Two Big Names!