Tag

Benchmark

500 articles archived under #benchmark · RSS

r/LocalLLaMA community 8d ago

OpenAI and Broadcom unveil LLM-optimized inference chip

https://openai.com/index/openai-broadcom-jalapeno-inference-chip/ Quoted from the start of the blog post: Early testing shows that the first-generation accelerator will deliver performance per watt substantially better than current state-of-the-art Built from the ground up for…

11
r/LocalLLaMA community 8d ago

Qwen-AgentWorld-35B-A3B for Coding?

Benchmark from its model card. Removed online models & Qwen-AgentWorld-397B-A17B from the table. Just Open models. Model MCP Search Term. SWE Android Web OS Overall DeepSeek-V4-Pro 63.27 27.61 51.26 59.44 55.17 50.32 63.70 52.97 GLM-5.1 67.60 22.46 47.32 52.07 59.10 51.50 59.13…

11
Hugging Face Daily Papers research 8d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Abstract Large language models face challenges in archive-grounded reasoning tasks involving evidence retrieval and synthesis across diverse document collections, with performance varying significantly across domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language…

26
r/MachineLearning community 8d ago

I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]

I've been comparing GPU/LLM providers for a side project and ended up with way too many browser tabs and spreadsheets. So I decided to pull the public pricing data into one sheet and compare it side by side. A quick disclaimer: this is not benchmark data . I didn't run latency…

32
Hugging Face Daily Papers research 8d ago

ChartWalker: Benchmarking the Cross-Chart RAG Task

Abstract ChartWalker presents a novel framework for cross-chart retrieval-augmented generation with hierarchical knowledge graph construction and structure-aware sampling for challenging multi-modal analytical tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Cross-Chart…

33
Hugging Face Daily Papers research 8d ago

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

Abstract Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers that demonstrates the need for comprehensive benchmarking beyond ImageNet class-conditional generation to assess true progress in generative modeling. Generated by…

25
Hugging Face Daily Papers research 8d ago

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Abstract A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Mental…

36
arXiv — Machine Learning research 8d ago

You Don't Need to Run Every Eval

arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to…

29
arXiv — Machine Learning research 8d ago

RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting

arXiv:2606.24062v1 Announce Type: new Abstract: Financial time series forecasting presents structural challenges absent from standard benchmarks. Log-returns are non-stationary, exhibit exceptionally low signal-to-noise (SNR) ratios, and are governed by regime-dependent temporal…

8
arXiv — Machine Learning research 8d ago

Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

arXiv:2606.24173v1 Announce Type: new Abstract: On-device fault detection enables real-time diagnostics without cloud dependency, but deploying machine learning models on resource-constrained hardware demands careful tradeoffs between accuracy, latency, and model size. We…

14
arXiv — NLP / Computation & Language research 8d ago

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

arXiv:2606.24162v1 Announce Type: new Abstract: Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject…

27
arXiv — Machine Learning research 8d ago

Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints

arXiv:2606.24353v1 Announce Type: cross Abstract: Bird's-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them…

6
arXiv — Machine Learning research 8d ago

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

arXiv:2606.24388v1 Announce Type: cross Abstract: We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by…

38
arXiv — NLP / Computation & Language research 8d ago

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark…

32
arXiv — NLP / Computation & Language research 8d ago

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

arXiv:2606.23992v1 Announce Type: new Abstract: Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark…

32
arXiv — NLP / Computation & Language research 8d ago

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language,…

38
arXiv — NLP / Computation & Language research 8d ago

A P\={a}ninian Foundation for Indic Language Processing

arXiv:2606.24172v1 Announce Type: new Abstract: More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks…

24
arXiv — NLP / Computation & Language research 8d ago

A Synthetic Reliability-Aware PINN Benchmark for Offshore Wind Turbine Support-Structure Monitoring with Bayesian Inverse Identification

arXiv:2606.24176v1 Announce Type: new Abstract: Reliable structural health monitoring (SHM) of offshore wind turbine (OWT) support structures requires fast state estimation from sparse measurements. Repeated high fidelity finite element or aeroelastic analyses are difficult to…

8
arXiv — NLP / Computation & Language research 8d ago

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

arXiv:2606.24200v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) in clinical settings increasingly requires multilingual retrieval against predominantly English evidence corpora. Multilingual medical retrieval demands three capabilities: cross-lingual…

36
arXiv — NLP / Computation & Language research 8d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of…

36
arXiv — NLP / Computation & Language research 8d ago

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

arXiv:2606.24530v1 Announce Type: new Abstract: We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real…

21
arXiv — NLP / Computation & Language research 8d ago

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

arXiv:2606.24627v1 Announce Type: new Abstract: Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect…

4
arXiv — NLP / Computation & Language research 8d ago

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

arXiv:2606.24714v1 Announce Type: new Abstract: Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening…

33
arXiv — NLP / Computation & Language research 8d ago

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv:2606.24391v1 Announce Type: cross Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept…

29
arXiv — NLP / Computation & Language research 8d ago

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained…

15
arXiv — NLP / Computation & Language research 8d ago

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

arXiv:2409.11363v2 Announce Type: replace Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially,…

20
arXiv — NLP / Computation & Language research 8d ago

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

arXiv:2501.11790v5 Announce Type: replace Abstract: Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable…

29
arXiv — NLP / Computation & Language research 8d ago

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

arXiv:2505.18542v4 Announce Type: replace Abstract: Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has focused on extracting linear action flows from instructional…

32
Hugging Face Daily Papers research 8d ago

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Abstract NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation…

21
Hugging Face Daily Papers research 8d ago

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Abstract Text-to-image models fail to generate counterfactual scenes because they rely on tightly coupled visual-textual patterns rather than causal reasoning, demonstrating limited understanding beyond pattern matching. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image…

26
r/MachineLearning community 8d ago

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

DeepSWE delivers four advances over existing public benchmarks: Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity: Tasks span a broad pool of 91 repositories across 5…

9
Hugging Face official-blog 8d ago

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Back to Articles a]:hidden"> Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Published June 24, 2026 Update on GitHub Upvote 2 Daniel Gert Nielsen daniel-treble treble-technologies Shivam Saini whojavumusic treble-technologies Alessia Milo alessia-treble…

11
Vercel — AI dev-tools 8d ago

GLM 5.2 Fast via Wafer now available on AI Gateway

GLM 5.2 Fast via Wafer is now available on AI Gateway . Based on our own benchmarking across small-context, large-context, and tool-call scenarios, Wafer delivers a 2x higher throughput than other providers serving GLM-5.2 on serverless, leading on decode and end-to-end speed…

7
r/LocalLLaMA community 8d ago

OpenMythos benchmarks

Hey everyone! OpenMythos benchmarks are finally here sorry it took about a week to post these. The delay was mainly because SWE-bench results weren't matching up with Qwen 3.6 27B official numbers. Turns out Qwen used a different eval harness and also refined/filtered the…

12
r/LocalLLaMA community 8d ago

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8…

10
Hacker News — AI on Front Page community 8d ago

Krea 2: SOTA open-weights 12B image model

Article URL: https://www.krea.ai/blog/krea-2-technical-report Comments URL: https://news.ycombinator.com/item?id=48646659 Points: 247 # Comments: 33

4
Hugging Face Daily Papers research 8d ago

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Abstract Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.…

5
r/LocalLLaMA community 9d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't…

19
r/LocalLLaMA community 9d ago

Human Evaluation of GLM-5.2

I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now. The last few days I've been testing it on a real world project, and it's…

6
Hugging Face Daily Papers research 9d ago

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

Abstract HAKARI-Bench provides a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct With the rapid spread of…

23
Hugging Face Daily Papers research 9d ago

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured…

19
Hugging Face Daily Papers research 9d ago

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents…

30
Hugging Face Daily Papers research 9d ago

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to…

14
Hugging Face Daily Papers research 9d ago

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Abstract PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms. Generated by…

5
r/MachineLearning community 9d ago

Non-deterministic Vulnerability Detection Benchmark System [P]

I work in firmware adjacent to AI, so not an ML guy exactly, so that's why I've come here. For work we got a bit concerned about Mythos and all the hype made me explore some benchmarking work. I now have this pretty cool benchmark that's about 80% done sitting around and haven't…

26
r/MachineLearning community 9d ago

Syntactically robust NLI for semantics of imperfectly generated text? [R]

Hi all, I'm looking for literature on relatively specific tooling. In autoregressive LLMs, there is substantial published work that used NLI on sub-claims produced by LLMs to gauge correctness of LLM answers. In diffusion (or D-) LLMs, the SoTA model generations that I see…

37
r/LocalLLaMA community 9d ago

NEX-N2-mini: "There is no Pareto frontier. I am Pareto". This Qwen3.5-MoE fine tune fixed 3.5 and 3.6 overthinking apparently on my tests.

I have been testing all popular MoE for my Mac and it seems I just found gold: 3.5/3.6 level of reasoning (if not slightly superior) at a fraction of the reasoning tokens used (wasted). Dynamic plot with other benchmarks here: https://benchmark-yourself.streamlit.app/…

4
r/LocalLLaMA community 10d ago

Gemma 4 QAT 31B responds better to KV cache quantization too

I've run benchmark from this post and got even better results on Gemma 4 31B   submitted by   /u/justicecurcian [link]   [comments]

29
Hugging Face Daily Papers research 10d ago

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

Abstract SpatialAvatar-0 enables high-quality 4D head avatar generation by combining feed-forward prediction with per-subject refinement through a shared Gaussian representation, achieving superior performance across multiple benchmarks. Generated by…

20
Hugging Face Daily Papers research 10d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Abstract Current memory agents lack reliable shared institutional deployment due to challenges in balancing utility, access control, and forgetting across multiple principals with diverse authorization contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory benchmarks for…

5

OpenAI and Broadcom unveil LLM-optimized inference chip

Qwen-AgentWorld-35B-A3B for Coding?

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]

ChartWalker: Benchmarking the Cross-Chart RAG Task

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

You Don't Need to Run Every Eval

RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting

Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

A P\={a}ninian Foundation for Indic Language Processing

A Synthetic Reliability-Aware PINN Benchmark for Offshore Wind Turbine Support-Structure Monitoring with Bayesian Inverse Identification

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

GLM 5.2 Fast via Wafer now available on AI Gateway

OpenMythos benchmarks

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

Krea 2: SOTA open-weights 12B image model

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Human Evaluation of GLM-5.2

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Non-deterministic Vulnerability Detection Benchmark System [P]

Syntactically robust NLI for semantics of imperfectly generated text? [R]

NEX-N2-mini: "There is no Pareto frontier. I am Pareto". This Qwen3.5-MoE fine tune fixed 3.5 and 3.6 overthinking apparently on my tests.

Gemma 4 QAT 31B responds better to KV cache quantization too

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents