News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow r/LocalLLaMA community 1h ago SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing. I’m pretty jaded like most of y’all. I don’t really get excited by new models much anymore. Last few weeks have been kinda meh to be honest. Monday, I stumbled upon SenseNova’s Mixture of Transformers models and they seem kinda like a different animal than other typical image… 4 r/LocalLLaMA community 2h ago [Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs? Mac can host large models but the prefill speed sucks, so I tested in it on my setup for… 25 arXiv — NLP / Computation & Language research 2h ago Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds arXiv:2607.00276v1 Announce Type: cross Abstract: Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning… 10 arXiv — Machine Learning research 2h ago Timesynth: A Temporal Fidelity Framework for Health Signal Digital Twins arXiv:2607.00431v1 Announce Type: new Abstract: Forecasting models for health-signal digital twins must preserve the oscillatory, frequency, phase, and state-transition dynamics of physiological signals, yet the pointwise metrics used to benchmark them cannot detect when these… 7 arXiv — NLP / Computation & Language research 2h ago MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules arXiv:2607.00464v1 Announce Type: cross Abstract: Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many… 22 arXiv — Machine Learning research 2h ago Interpretable vs Learned Encoders for High-Cardinality Fraud Detection arXiv:2607.00477v1 Announce Type: new Abstract: A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation… 7 arXiv — Machine Learning research 2h ago Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling arXiv:2607.01022v1 Announce Type: new Abstract: Spatiotemporal point processes (STPPs) model event data in continuous time and space, with applications in mobility, epidemiology, and public safety. Recent neural STPPs span expressive intensity models, conditional density models,… 14 arXiv — NLP / Computation & Language research 2h ago Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth arXiv:2607.00139v1 Announce Type: new Abstract: The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, high-stakes domains. This is particularly acute for Arabic sociolinguistic knowledge: credible grading requires not only… 20 arXiv — NLP / Computation & Language research 2h ago Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting arXiv:2607.00159v1 Announce Type: new Abstract: Knowledge-Based Visual Question Answering (KB-VQA) aims to evaluate whether Visual Language Models (VLMs) can retrieve, ground, and reason over external structured knowledge beyond visual evidence. In practice, answer accuracy is… 30 arXiv — NLP / Computation & Language research 2h ago ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs arXiv:2607.00171v1 Announce Type: new Abstract: Text embeddings are standard for semantic similarity tasks, yet their evaluation remains an open challenge. Current benchmarks are static, cover only a limited set of languages, are often domain-specific, susceptible to… 4 arXiv — NLP / Computation & Language research 2h ago LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR arXiv:2607.00250v1 Announce Type: new Abstract: Maltese has decent text corpora and pretrained language models, but, like many languages outside the handful with large OCR benchmarks, only a single known real labelled PDF corpus for OCR training, 57 page, far below what… 25 arXiv — NLP / Computation & Language research 2h ago YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese arXiv:2607.00664v1 Announce Type: new Abstract: We propose YOMI-Bench, a benchmark for evaluating kanji reading and phonological understanding of large language models (LLMs) for Japanese. In Japanese, a single kanji character often has multiple possible readings, making it… 8 arXiv — NLP / Computation & Language research 2h ago MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark arXiv:2607.00724v1 Announce Type: new Abstract: Multilingual fluency often invites a stronger assumption: a model that can speak a user's language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption… 8 arXiv — NLP / Computation & Language research 2h ago Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking arXiv:2607.01103v1 Announce Type: new Abstract: Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration… 12 arXiv — NLP / Computation & Language research 2h ago AGC-Bench: Measuring Artificial General Creativity arXiv:2607.01152v1 Announce Type: new Abstract: Creativity research has debated whether creativity is domain-specific (e.g., visual, writing, science), and if it is psychometrically separable from general intelligence. Both questions now apply to LLMs, but a unified benchmark of… 10 arXiv — NLP / Computation & Language research 2h ago Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded… 14 arXiv — NLP / Computation & Language research 2h ago MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.… 21 r/LocalLLaMA community 5h ago I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3 Just wanted to share that I was looking for an optimal way to run Ornith 35B in FP8 with E4M3 and MTP with vLLM but there was no out-of-the-box model with MTP drafter support. So I grafted this new model! It's 18% faster than without MTP and the drafter acceptance rate is not… 31 r/MachineLearning community 5h ago Making Optimization Work When Labels Are Scarce [R] https://www.gnosyslabs.com/case-studies/safety-classifier-sparse-labels Gnosys is an autonomous model engineer: it improves prompts and classifiers when ground truth is too sparse for conventional optimization. On ToxicChat, a public safety benchmark, under realistic label… 23 r/LocalLLaMA community 8h ago Senior SWE Bench: a new benchmark focussed on realistically underspecified feature tasks   submitted by   /u/jordo45 [link]   [comments] 37 r/LocalLLaMA community 8h ago My reasons to run local models I can finetune any model on any dataset I want. I can use techniques like speculative decoding and other sota approaches to get the max tps The llm provides like anthropic and openai are not getting access to my data The hardware is reusable for vision text speech, and I can run… 10 r/LocalLLaMA community 11h ago Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models..... Some backstory I've been working on my local agent (openclaw), and I wanted to give it the skill to reconstruct calendar entries from a photo of the screen. I couldn't get at the calendar through an API (long story), so a photo was the only low-friction way to export the data.… 16 r/LocalLLaMA community 12h ago Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ? Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ? Wondering which one would be better at speed / coding / reasoning   submitted by   /u/soyalemujica [link]   [comments] 32 Hugging Face Daily Papers research 12h ago SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions Abstract SWE-Interact presents a testbed that evaluates coding agents in realistic multi-turn, user-driven software engineering scenarios, revealing significant gaps between single-turn performance and interactive task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We… 6 r/LocalLLaMA community 14h ago The gap between closed and open models might be much smaller than commonly assumed, because we don’t know what closed model providers do *in addition to* model inference When Claude dominates GLM-5.2 in benchmarks, it’s usually assumed that Anthropic has superior model architectures, superior training pipelines, and other advanced machine learning techniques that make their models better than the competition. But actually, this doesn’t follow.… 10 r/LocalLLaMA community 15h ago SWE-rebench leaderboard update: GLM-5.2, Qwen3.6-27B, Qwen3.6-35B-A3B, Gemma 4 31B and more + improved UI Hi all, We made several updates to the SWE-rebench leaderboard: added new models, refreshed recent results, and reworked the leaderboard UI to make results easier to read, compare, and understand. New Models: Claude Opus 4.8 xhigh: 56.5% — 2.48M tokens GLM-5.2: 51.1% — 2.62M… 16 Hugging Face Daily Papers research 16h ago Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing Abstract A large-scale video editing dataset and model are introduced that support multi-task and structural manipulations through advanced data synthesis and network architectures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing instruction-based video editing datasets… 38 Hugging Face Daily Papers research 21h ago RedVox: Safety and Fairness Gaps in Speech Models Across Languages Abstract Multilingual safety and fairness benchmark for speech models reveals persistent vulnerabilities across languages and naturalistic conditions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-capable models are increasingly deployed in real-world applications across… 36 Hugging Face Daily Papers research 21h ago Xiaomi-GUI-0 Technical Report Abstract A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface (GUI) agents build on… 7 arXiv — Machine Learning research 1d ago Accelerometry-Derived Digital Biomarkers for Cardiometabolic Risk: A Population-Representative Tabular Benchmark with Uncertainty Quantification arXiv:2606.30702v1 Announce Type: new Abstract: Structured tabular data dominates clinical medicine, yet existing benchmarks fail to reflect real-world properties like complex survey sampling, demographic oversampling, and subgroup fairness. We introduce the NHANES Accelerometry… 31 arXiv — Machine Learning research 1d ago PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks arXiv:2606.31154v1 Announce Type: new Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely… 25 arXiv — Machine Learning research 1d ago Probing Memorization of Tabular In-Context Learning arXiv:2606.31208v1 Announce Type: new Abstract: Large tabular models (LTMs), i.e., tabular foundation models leveraging in-context learning (ICL), achieve state-of-the-art performance on tabular tasks. While LLMs are known to unintentionally memorize training data, the… 19 arXiv — NLP / Computation & Language research 1d ago Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions arXiv:2606.30790v1 Announce Type: new Abstract: Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs)… 26 arXiv — NLP / Computation & Language research 1d ago Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer arXiv:2606.30943v1 Announce Type: new Abstract: Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research results between these communities, which affects international collaboration and the progress of… 8 arXiv — NLP / Computation & Language research 1d ago Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies arXiv:2606.31039v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong semantic capabilities, yet their resilience to manipulative linguistic patterns such as logical fallacies remains underexplored. Prior work has primarily examined whether LLMs can… 12 arXiv — NLP / Computation & Language research 1d ago A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases arXiv:2606.31041v1 Announce Type: new Abstract: Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names,… 12 arXiv — NLP / Computation & Language research 1d ago What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR arXiv:2606.31112v1 Announce Type: new Abstract: ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including… 31 arXiv — NLP / Computation & Language research 1d ago Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering arXiv:2606.31432v1 Announce Type: new Abstract: Medical multiple-choice question answering requires parameter-efficient adaptation across heterogeneous knowledge domains and reasoning operations. A medication question, a diagnostic decision, a public-health item, and a… 33 arXiv — NLP / Computation & Language research 1d ago Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap arXiv:2606.31446v1 Announce Type: new Abstract: RVL-CDIP is a popular dataset for benchmarking document classifiers. However, the dataset contains ample amounts of label errors as well as non-trivial amounts of test-train overlap, both of which may impact model performance… 25 arXiv — NLP / Computation & Language research 1d ago FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents arXiv:2606.31522v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision… 19 arXiv — NLP / Computation & Language research 1d ago CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations… 37 arXiv — NLP / Computation & Language research 1d ago STEB: Style Text Embedding Benchmark arXiv:2606.31741v1 Announce Type: new Abstract: While semantic embeddings are rigorously evaluated on the Massive Text Embedding Benchmark, the evaluation of style embeddings remains fragmented, with each work relying on their own set of tasks and datasets. To bridge this gap,… 27 arXiv — NLP / Computation & Language research 1d ago Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper… 25 arXiv — NLP / Computation & Language research 1d ago LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish arXiv:2606.31947v1 Announce Type: new Abstract: State-of-the-art speech datasets predominantly focus on widely spoken languages, often overlooking low-resource languages such as Luxembourgish, which remain underrepresented in speech technology research. In this work, we… 25 arXiv — NLP / Computation & Language research 1d ago Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization arXiv:2606.31002v1 Announce Type: cross Abstract: Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean… 35 arXiv — NLP / Computation & Language research 1d ago HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite… 29 arXiv — NLP / Computation & Language research 1d ago CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes arXiv:2606.31435v1 Announce Type: cross Abstract: Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or… 38 arXiv — NLP / Computation & Language research 1d ago Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2 arXiv:2606.31543v1 Announce Type: cross Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I… 4 r/LocalLLaMA community 1d ago [audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml I’m the author of audio.cpp, a C++/ggml runtime for local audio models. I just added VibeVoice 1.5B support and wanted to share the benchmark because long-form multi-speaker TTS is a good stress test for local inference runtimes. Result on RTX 5090: VibeVoice 1.5B Audio length:… 26 Hugging Face Daily Papers research 1d ago One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding Abstract InnerZoom addresses GUI grounding challenges by preserving target-region awareness across decoder layers through a single-forward pass that bridges cross-layer evidence, achieving state-of-the-art performance with reduced computational cost. Generated by… 16 Page 1 of 10 · 500 articles Older →