Tag

Benchmark

500 articles archived under #benchmark · RSS

r/LocalLLaMA community 1h ago

SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.

I’m pretty jaded like most of y’all. I don’t really get excited by new models much anymore. Last few weeks have been kinda meh to be honest. Monday, I stumbled upon SenseNova’s Mixture of Transformers models and they seem kinda like a different animal than other typical image…

4
r/LocalLLaMA community 2h ago

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs? Mac can host large models but the prefill speed sucks, so I tested in it on my setup for…

25
arXiv — NLP / Computation & Language research 2h ago

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

arXiv:2607.00276v1 Announce Type: cross Abstract: Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning…

10
arXiv — Machine Learning research 2h ago

Timesynth: A Temporal Fidelity Framework for Health Signal Digital Twins

arXiv:2607.00431v1 Announce Type: new Abstract: Forecasting models for health-signal digital twins must preserve the oscillatory, frequency, phase, and state-transition dynamics of physiological signals, yet the pointwise metrics used to benchmark them cannot detect when these…

7
arXiv — NLP / Computation & Language research 2h ago

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

arXiv:2607.00464v1 Announce Type: cross Abstract: Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many…

22
arXiv — Machine Learning research 2h ago

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

arXiv:2607.00477v1 Announce Type: new Abstract: A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation…

7
arXiv — Machine Learning research 2h ago

Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling

arXiv:2607.01022v1 Announce Type: new Abstract: Spatiotemporal point processes (STPPs) model event data in continuous time and space, with applications in mobility, epidemiology, and public safety. Recent neural STPPs span expressive intensity models, conditional density models,…

14
arXiv — NLP / Computation & Language research 2h ago

Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth

arXiv:2607.00139v1 Announce Type: new Abstract: The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, high-stakes domains. This is particularly acute for Arabic sociolinguistic knowledge: credible grading requires not only…

20
arXiv — NLP / Computation & Language research 2h ago

Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting

arXiv:2607.00159v1 Announce Type: new Abstract: Knowledge-Based Visual Question Answering (KB-VQA) aims to evaluate whether Visual Language Models (VLMs) can retrieve, ground, and reason over external structured knowledge beyond visual evidence. In practice, answer accuracy is…

30
arXiv — NLP / Computation & Language research 2h ago

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

arXiv:2607.00171v1 Announce Type: new Abstract: Text embeddings are standard for semantic similarity tasks, yet their evaluation remains an open challenge. Current benchmarks are static, cover only a limited set of languages, are often domain-specific, susceptible to…

4
arXiv — NLP / Computation & Language research 2h ago

LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

arXiv:2607.00250v1 Announce Type: new Abstract: Maltese has decent text corpora and pretrained language models, but, like many languages outside the handful with large OCR benchmarks, only a single known real labelled PDF corpus for OCR training, 57 page, far below what…

25
arXiv — NLP / Computation & Language research 2h ago

YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

arXiv:2607.00664v1 Announce Type: new Abstract: We propose YOMI-Bench, a benchmark for evaluating kanji reading and phonological understanding of large language models (LLMs) for Japanese. In Japanese, a single kanji character often has multiple possible readings, making it…

8
arXiv — NLP / Computation & Language research 2h ago

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

arXiv:2607.00724v1 Announce Type: new Abstract: Multilingual fluency often invites a stronger assumption: a model that can speak a user's language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption…

8
arXiv — NLP / Computation & Language research 2h ago

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

arXiv:2607.01103v1 Announce Type: new Abstract: Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration…

12
arXiv — NLP / Computation & Language research 2h ago

AGC-Bench: Measuring Artificial General Creativity

arXiv:2607.01152v1 Announce Type: new Abstract: Creativity research has debated whether creativity is domain-specific (e.g., visual, writing, science), and if it is psychometrically separable from general intelligence. Both questions now apply to LLMs, but a unified benchmark of…

10
arXiv — NLP / Computation & Language research 2h ago

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded…

14
arXiv — NLP / Computation & Language research 2h ago

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.…

21
r/LocalLLaMA community 5h ago

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

Just wanted to share that I was looking for an optimal way to run Ornith 35B in FP8 with E4M3 and MTP with vLLM but there was no out-of-the-box model with MTP drafter support. So I grafted this new model! It's 18% faster than without MTP and the drafter acceptance rate is not…

31
r/MachineLearning community 5h ago

Making Optimization Work When Labels Are Scarce [R]

https://www.gnosyslabs.com/case-studies/safety-classifier-sparse-labels Gnosys is an autonomous model engineer: it improves prompts and classifiers when ground truth is too sparse for conventional optimization. On ToxicChat, a public safety benchmark, under realistic label…

23
r/LocalLLaMA community 8h ago

Senior SWE Bench: a new benchmark focussed on realistically underspecified feature tasks

  submitted by   /u/jordo45 [link]   [comments]

37
r/LocalLLaMA community 8h ago

My reasons to run local models

I can finetune any model on any dataset I want. I can use techniques like speculative decoding and other sota approaches to get the max tps The llm provides like anthropic and openai are not getting access to my data The hardware is reusable for vision text speech, and I can run…

10
r/LocalLLaMA community 11h ago

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Some backstory I've been working on my local agent (openclaw), and I wanted to give it the skill to reconstruct calendar entries from a photo of the screen. I couldn't get at the calendar through an API (long story), so a photo was the only low-friction way to export the data.…

16
r/LocalLLaMA community 12h ago

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ?

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ? Wondering which one would be better at speed / coding / reasoning   submitted by   /u/soyalemujica [link]   [comments]

32
Hugging Face Daily Papers research 12h ago

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Abstract SWE-Interact presents a testbed that evaluates coding agents in realistic multi-turn, user-driven software engineering scenarios, revealing significant gaps between single-turn performance and interactive task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

6
r/LocalLLaMA community 14h ago

The gap between closed and open models might be much smaller than commonly assumed, because we don’t know what closed model providers do *in addition to* model inference

When Claude dominates GLM-5.2 in benchmarks, it’s usually assumed that Anthropic has superior model architectures, superior training pipelines, and other advanced machine learning techniques that make their models better than the competition. But actually, this doesn’t follow.…

10
r/LocalLLaMA community 15h ago

SWE-rebench leaderboard update: GLM-5.2, Qwen3.6-27B, Qwen3.6-35B-A3B, Gemma 4 31B and more + improved UI

Hi all, We made several updates to the SWE-rebench leaderboard: added new models, refreshed recent results, and reworked the leaderboard UI to make results easier to read, compare, and understand. New Models: Claude Opus 4.8 xhigh: 56.5% — 2.48M tokens GLM-5.2: 51.1% — 2.62M…

16
Hugging Face Daily Papers research 16h ago

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

Abstract A large-scale video editing dataset and model are introduced that support multi-task and structural manipulations through advanced data synthesis and network architectures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing instruction-based video editing datasets…

38
Hugging Face Daily Papers research 21h ago

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Abstract Multilingual safety and fairness benchmark for speech models reveals persistent vulnerabilities across languages and naturalistic conditions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-capable models are increasingly deployed in real-world applications across…

36
Hugging Face Daily Papers research 21h ago

Xiaomi-GUI-0 Technical Report

Abstract A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface (GUI) agents build on…

7
arXiv — Machine Learning research 1d ago

Accelerometry-Derived Digital Biomarkers for Cardiometabolic Risk: A Population-Representative Tabular Benchmark with Uncertainty Quantification

arXiv:2606.30702v1 Announce Type: new Abstract: Structured tabular data dominates clinical medicine, yet existing benchmarks fail to reflect real-world properties like complex survey sampling, demographic oversampling, and subgroup fairness. We introduce the NHANES Accelerometry…

31
arXiv — Machine Learning research 1d ago

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

arXiv:2606.31154v1 Announce Type: new Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely…

25
arXiv — Machine Learning research 1d ago

Probing Memorization of Tabular In-Context Learning

arXiv:2606.31208v1 Announce Type: new Abstract: Large tabular models (LTMs), i.e., tabular foundation models leveraging in-context learning (ICL), achieve state-of-the-art performance on tabular tasks. While LLMs are known to unintentionally memorize training data, the…

19
arXiv — NLP / Computation & Language research 1d ago

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

arXiv:2606.30790v1 Announce Type: new Abstract: Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs)…

26
arXiv — NLP / Computation & Language research 1d ago

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

arXiv:2606.30943v1 Announce Type: new Abstract: Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research results between these communities, which affects international collaboration and the progress of…

8
arXiv — NLP / Computation & Language research 1d ago

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

arXiv:2606.31039v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong semantic capabilities, yet their resilience to manipulative linguistic patterns such as logical fallacies remains underexplored. Prior work has primarily examined whether LLMs can…

12
arXiv — NLP / Computation & Language research 1d ago

A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

arXiv:2606.31041v1 Announce Type: new Abstract: Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names,…

12
arXiv — NLP / Computation & Language research 1d ago

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

arXiv:2606.31112v1 Announce Type: new Abstract: ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including…

31
arXiv — NLP / Computation & Language research 1d ago

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

arXiv:2606.31432v1 Announce Type: new Abstract: Medical multiple-choice question answering requires parameter-efficient adaptation across heterogeneous knowledge domains and reasoning operations. A medication question, a diagnostic decision, a public-health item, and a…

33
arXiv — NLP / Computation & Language research 1d ago

Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

arXiv:2606.31446v1 Announce Type: new Abstract: RVL-CDIP is a popular dataset for benchmarking document classifiers. However, the dataset contains ample amounts of label errors as well as non-trivial amounts of test-train overlap, both of which may impact model performance…

25
arXiv — NLP / Computation & Language research 1d ago

FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

arXiv:2606.31522v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision…

19
arXiv — NLP / Computation & Language research 1d ago

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations…

37
arXiv — NLP / Computation & Language research 1d ago

STEB: Style Text Embedding Benchmark

arXiv:2606.31741v1 Announce Type: new Abstract: While semantic embeddings are rigorously evaluated on the Massive Text Embedding Benchmark, the evaluation of style embeddings remains fragmented, with each work relying on their own set of tasks and datasets. To bridge this gap,…

27
arXiv — NLP / Computation & Language research 1d ago

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper…

25
arXiv — NLP / Computation & Language research 1d ago

LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

arXiv:2606.31947v1 Announce Type: new Abstract: State-of-the-art speech datasets predominantly focus on widely spoken languages, often overlooking low-resource languages such as Luxembourgish, which remain underrepresented in speech technology research. In this work, we…

25
arXiv — NLP / Computation & Language research 1d ago

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

arXiv:2606.31002v1 Announce Type: cross Abstract: Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean…

35
arXiv — NLP / Computation & Language research 1d ago

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite…

29
arXiv — NLP / Computation & Language research 1d ago

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

arXiv:2606.31435v1 Announce Type: cross Abstract: Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or…

38
arXiv — NLP / Computation & Language research 1d ago

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

arXiv:2606.31543v1 Announce Type: cross Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I…

4
r/LocalLLaMA community 1d ago

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

I’m the author of audio.cpp, a C++/ggml runtime for local audio models. I just added VibeVoice 1.5B support and wanted to share the benchmark because long-form multi-speaker TTS is a good stress test for local inference runtimes. Result on RTX 5090: VibeVoice 1.5B Audio length:…

26
Hugging Face Daily Papers research 1d ago

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Abstract InnerZoom addresses GUI grounding challenges by preserving target-region awareness across decoder layers through a single-forward pass that bridges cross-layer evidence, achieving state-of-the-art performance with reduced computational cost. Generated by…

16

SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

Timesynth: A Temporal Fidelity Framework for Health Signal Digital Twins

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling

Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth

Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

AGC-Bench: Measuring Artificial General Creativity

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

Making Optimization Work When Labels Are Scarce [R]

Senior SWE Bench: a new benchmark focussed on realistically underspecified feature tasks

My reasons to run local models

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ?

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

The gap between closed and open models might be much smaller than commonly assumed, because we don’t know what closed model providers do *in addition to* model inference

SWE-rebench leaderboard update: GLM-5.2, Qwen3.6-27B, Qwen3.6-35B-A3B, Gemma 4 31B and more + improved UI

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Xiaomi-GUI-0 Technical Report

Accelerometry-Derived Digital Biomarkers for Cardiometabolic Risk: A Population-Representative Tabular Benchmark with Uncertainty Quantification

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

Probing Memorization of Tabular In-Context Learning

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

STEB: Style Text Embedding Benchmark

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

The gap between closed and open models might be much smaller than commonly assumed, because we don’t know what closed model providers do in addition to model inference