Tag

Benchmark

500 articles archived under #benchmark · RSS

r/LocalLLaMA community 10d ago

Leaderboard for quantized models, similar to artificial analysis?

Artificial analysis’ leaderboard for models is somewhat useful for comparing model intelligence, but does not take into account quantization for open models. Is there a way to better compare quantized open models against each other and proprietary models other than running them…

35
Hugging Face Daily Papers research 10d ago

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Abstract WorldLines benchmark evaluates long-term memory in embodied agents through household scenarios, while ObsMem framework addresses challenges in partial observability and memory translation for decision-making. Generated by Qwen/Qwen2.5-Coder-32B-Instruct To assist humans…

19
r/LocalLLaMA community 10d ago

Best local model for vision - 2nd benchmark update - 21 Jun 2026

I previously posted the first results of my VLM benchmark . There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it…

9
r/LocalLLaMA community 11d ago

GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg)

Saw this breakdown from Theo (t3.gg) on X showing the latest DeepSWE leaderboard stats for the new GLM-5.2 open-weight model.The good news: it's officially surpassing GPT-5.4 and the entire Gemini lineup in raw coding capability. Seeing an open-weight model punch that high is…

15
r/LocalLLaMA community 12d ago

Some llama.cpp B70 SYCL benchmarks

build: dd4623a74 (9640) | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma4 12B Q8_0 | 11.78 GiB | 11.91 B | SYCL | -1 | pp512 | 1578.19 ± 7.82 |…

11
r/LocalLLaMA community 12d ago

I benchmarked Claude's "Fast C++". It wasn't faster

  submitted by   /u/User_Deprecated [link]   [comments]

15
Hugging Face Daily Papers research 13d ago

Context-Aware RL for Agentic and Multimodal LLMs

Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by…

21
Hugging Face Daily Papers research 13d ago

The Data Manifold under the Microscope

Abstract A benchmarking framework is introduced to study data-manifold geometry by extending dSprites and COIL-20 datasets with additional transformation dimensions and dense sampling, enabling accurate estimation of curvature, reach, and volume for theoretical analysis and…

36
r/LocalLLaMA community 13d ago

Benchmarking or benchmarketing?

Maybe I’m getting cynical, but LLM benchmarking is starting to feel less like measurement and more like marketing and positioning. Every week there’s a new leaderboard score, new chart, new eval suite, or some claim that a model is suddenly the best. It feels like benchmarks…

35
r/LocalLLaMA community 13d ago

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

You can read about it here: https://artificialanalysis.ai/articles/aa-briefcase This is a solid benchmark from Artificial Analysis. It basically tests an LLMs ability to plan and execute tasks. And more importantly, it is a new benchmark that is not saturated, so no one can…

32
Hugging Face Daily Papers research 13d ago

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Abstract Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

33
r/LocalLLaMA community 13d ago

Has anyone here used VibeThinker-3B outside benchmarks?

Just curious, given the hype and benchmark numbers. Curious about real-world behavior: debugging, coding assistance, reasoning over messy prompts, local latency, failure modes, and whether it actually feels useful versus just optimized for verifiable evals.…

23
Hugging Face Daily Papers research 13d ago

No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages

Abstract Research addresses code generation challenges for no-resource programming languages by developing benchmarks and proposing a method that combines further pre-training with weight difference transfer to create specialized instruction-following models at reduced…

27
r/LocalLLaMA community 13d ago

Researchers trained a Deep Research agent with 32 H100s and open-sourced everything

Ohio State University's NLP team released QUEST-35B, an open-source Deep Research agent trained using ~32 H100s and ~8K synthetic samples. The team open-sourced the training recipe, code, weights and datasets. Benchmark results show competitive performance against several…

13
Hugging Face Daily Papers research 13d ago

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Abstract Game development frameworks and benchmarks were created using data from game jam competitions to evaluate code generation and project-level programming capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Current AI-driven game development has made substantial…

25
Hugging Face Daily Papers research 13d ago

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Abstract A large-scale real-world dataset called DF3DV-1K is introduced to address the lack of clean and cluttered image sets for distractor-free radiance field research, containing 1,048 scenes with 89,924 images across 128 distractor types and 161 scene themes, along with a…

5
arXiv — NLP / Computation & Language research 13d ago

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

arXiv:2606.19558v1 Announce Type: cross Abstract: Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a…

32
arXiv — Machine Learning research 13d ago

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for…

35
arXiv — Machine Learning research 13d ago

MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery

arXiv:2606.19624v1 Announce Type: new Abstract: Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the…

16
arXiv — Machine Learning research 13d ago

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

arXiv:2606.19636v1 Announce Type: new Abstract: Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic…

20
arXiv — Machine Learning research 13d ago

Efficient Neural Network Model Selection for Few-Class Application Datasets

arXiv:2606.19712v1 Announce Type: new Abstract: While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are…

29
arXiv — NLP / Computation & Language research 13d ago

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

arXiv:2606.19352v1 Announce Type: new Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented…

16
arXiv — NLP / Computation & Language research 13d ago

LaViSA: A Language and Vision Structural Ambiguity Benchmark

arXiv:2606.19552v1 Announce Type: new Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving…

22
arXiv — NLP / Computation & Language research 13d ago

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

arXiv:2606.19881v1 Announce Type: new Abstract: Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector…

38
arXiv — NLP / Computation & Language research 13d ago

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

arXiv:2606.20255v1 Announce Type: new Abstract: We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for…

34
arXiv — NLP / Computation & Language research 13d ago

Benchmarking Agentic Review Systems

arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems…

15
arXiv — NLP / Computation & Language research 13d ago

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

arXiv:2606.19788v1 Announce Type: cross Abstract: We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies,…

29
arXiv — NLP / Computation & Language research 13d ago

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

arXiv:2606.19830v1 Announce Type: cross Abstract: Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to…

6
arXiv — NLP / Computation & Language research 13d ago

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

arXiv:2507.00875v3 Announce Type: replace Abstract: Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology,…

38
arXiv — NLP / Computation & Language research 13d ago

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and…

22
Hugging Face Daily Papers research 13d ago

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Abstract FAPO optimizes LLM pipelines by combining prompt editing with structural changes, demonstrating superior performance across multiple benchmarks and security tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-step LLM pipelines fail through interactions among…

38
Hugging Face Daily Papers research 13d ago

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Abstract FreeStyle is a scalable dual-reference generation framework that uses community LoRA mining to create large-scale style-content triplets while addressing content leakage through disentanglement mechanisms and a comprehensive benchmark. Generated by…

16
Hugging Face Daily Papers research 13d ago

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

27
Hugging Face Daily Papers research 13d ago

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Abstract A two-stage iterative framework alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems. Generated by…

23
r/LocalLLaMA community 14d ago

Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads

rtk , headroom , and caveman keep showing up whenever someone posts about cutting their token bill 60-90%. I wanted to know what they save on an actual bill instead of a benchmark, so I replayed all three over my own Claude Code history. My corpus was 500 of my own Claude Code…

11
r/MachineLearning community 14d ago

Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D]

I have been thinking a lot about how poorly isolated benchmark metrics capture real conversational system quality once models are deployed into multi-turn environments. You can have strong STT scores, decent latency, high task completion rates, and still end up with…

25
r/LocalLLaMA community 14d ago

GLM-5.2 Is The Best Open Weight Creative Writing Model

As Per Sam Paech's Creative Writing Benchmark on EQ Bench: https://eqbench.com/creative_writing.html   submitted by   /u/Few_Painter_5588 [link]   [comments]

24
Hugging Face Daily Papers research 14d ago

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Abstract 3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks. Generated by…

4
Hugging Face Daily Papers research 14d ago

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be…

6
Hugging Face Daily Papers research 14d ago

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and…

29
Hugging Face Daily Papers research 14d ago

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

Abstract A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predictive code…

17
r/LocalLLaMA community 14d ago

Le Chaton Fat Flash local when?

We are very happy with Le Chaton Fat SOTA but most of us would like to run it locally. You know, for privacy and sovereignty reasons. Does anyone have any updates when a local "flash" or "small" version is available?   submitted by   /u/corpo_monkey [link]  …

31
arXiv — Machine Learning research 14d ago

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

arXiv:2606.18338v1 Announce Type: new Abstract: The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule…

23
arXiv — Machine Learning research 14d ago

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

arXiv:2606.18367v1 Announce Type: new Abstract: Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on…

19
arXiv — Machine Learning research 14d ago

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under…

7
arXiv — Machine Learning research 14d ago

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

arXiv:2606.18640v1 Announce Type: new Abstract: Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized…

37
arXiv — NLP / Computation & Language research 14d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory…

22
arXiv — Machine Learning research 14d ago

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

arXiv:2606.18970v1 Announce Type: new Abstract: Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However,…

38
arXiv — Machine Learning research 14d ago

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

arXiv:2606.19036v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection…

16
arXiv — NLP / Computation & Language research 14d ago

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the…

19

Leaderboard for quantized models, similar to artificial analysis?

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Best local model for vision - 2nd benchmark update - 21 Jun 2026

GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg)

Some llama.cpp B70 SYCL benchmarks

I benchmarked Claude's "Fast C++". It wasn't faster

Context-Aware RL for Agentic and Multimodal LLMs

The Data Manifold under the Microscope

Benchmarking or benchmarketing?

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Has anyone here used VibeThinker-3B outside benchmarks?

No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages

Researchers trained a Deep Research agent with 32 H100s and open-sourced everything

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

Efficient Neural Network Model Selection for Few-Class Application Datasets

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

LaViSA: A Language and Vision Structural Ambiguity Benchmark

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Benchmarking Agentic Review Systems

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads

Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D]

GLM-5.2 Is The Best Open Weight Creative Writing Model

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

Le Chaton Fat Flash local when?

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

VISUALSKILL: Multimodal Skills for Computer-Use Agents