News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow r/LocalLLaMA community 10d ago Leaderboard for quantized models, similar to artificial analysis? Artificial analysis’ leaderboard for models is somewhat useful for comparing model intelligence, but does not take into account quantization for open models. Is there a way to better compare quantized open models against each other and proprietary models other than running them… 35 Hugging Face Daily Papers research 10d ago WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents Abstract WorldLines benchmark evaluates long-term memory in embodied agents through household scenarios, while ObsMem framework addresses challenges in partial observability and memory translation for decision-making. Generated by Qwen/Qwen2.5-Coder-32B-Instruct To assist humans… 19 r/LocalLLaMA community 10d ago Best local model for vision - 2nd benchmark update - 21 Jun 2026 I previously posted the first results of my VLM benchmark . There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it… 9 r/LocalLLaMA community 11d ago GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg) Saw this breakdown from Theo (t3.gg) on X showing the latest DeepSWE leaderboard stats for the new GLM-5.2 open-weight model.The good news: it's officially surpassing GPT-5.4 and the entire Gemini lineup in raw coding capability. Seeing an open-weight model punch that high is… 15 r/LocalLLaMA community 12d ago Some llama.cpp B70 SYCL benchmarks build: dd4623a74 (9640) | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma4 12B Q8_0 | 11.78 GiB | 11.91 B | SYCL | -1 | pp512 | 1578.19 ± 7.82 |… 11 r/LocalLLaMA community 12d ago I benchmarked Claude's "Fast C++". It wasn't faster   submitted by   /u/User_Deprecated [link]   [comments] 15 Hugging Face Daily Papers research 13d ago Context-Aware RL for Agentic and Multimodal LLMs Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by… 21 Hugging Face Daily Papers research 13d ago The Data Manifold under the Microscope Abstract A benchmarking framework is introduced to study data-manifold geometry by extending dSprites and COIL-20 datasets with additional transformation dimensions and dense sampling, enabling accurate estimation of curvature, reach, and volume for theoretical analysis and… 36 r/LocalLLaMA community 13d ago Benchmarking or benchmarketing? Maybe I’m getting cynical, but LLM benchmarking is starting to feel less like measurement and more like marketing and positioning. Every week there’s a new leaderboard score, new chart, new eval suite, or some claim that a model is suddenly the best. It feels like benchmarks… 35 r/LocalLLaMA community 13d ago New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts You can read about it here: https://artificialanalysis.ai/articles/aa-briefcase This is a solid benchmark from Artificial Analysis. It basically tests an LLMs ability to plan and execute tasks. And more importantly, it is a new benchmark that is not saturated, so no one can… 32 Hugging Face Daily Papers research 13d ago Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages Abstract Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 33 r/LocalLLaMA community 13d ago Has anyone here used VibeThinker-3B outside benchmarks? Just curious, given the hype and benchmark numbers. Curious about real-world behavior: debugging, coding assistance, reasoning over messy prompts, local latency, failure modes, and whether it actually feels useful versus just optimized for verifiable evals.… 23 Hugging Face Daily Papers research 13d ago No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages Abstract Research addresses code generation challenges for no-resource programming languages by developing benchmarks and proposing a method that combines further pre-training with weight difference transfer to create specialized instruction-following models at reduced… 27 r/LocalLLaMA community 13d ago Researchers trained a Deep Research agent with 32 H100s and open-sourced everything Ohio State University's NLP team released QUEST-35B, an open-source Deep Research agent trained using ~32 H100s and ~8K synthetic samples. The team open-sourced the training recipe, code, weights and datasets. Benchmark results show competitive performance against several… 13 Hugging Face Daily Papers research 13d ago JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines Abstract Game development frameworks and benchmarks were created using data from game jam competitions to evaluate code generation and project-level programming capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Current AI-driven game development has made substantial… 25 Hugging Face Daily Papers research 13d ago DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis Abstract A large-scale real-world dataset called DF3DV-1K is introduced to address the lack of clean and cluttered image sets for distractor-free radiance field research, containing 1,048 scenes with 89,924 images across 128 distractor types and 161 scene themes, along with a… 5 arXiv — NLP / Computation & Language research 13d ago Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment arXiv:2606.19558v1 Announce Type: cross Abstract: Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a… 32 arXiv — Machine Learning research 13d ago IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for… 35 arXiv — Machine Learning research 13d ago MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery arXiv:2606.19624v1 Announce Type: new Abstract: Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the… 16 arXiv — Machine Learning research 13d ago Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation arXiv:2606.19636v1 Announce Type: new Abstract: Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic… 20 arXiv — Machine Learning research 13d ago Efficient Neural Network Model Selection for Few-Class Application Datasets arXiv:2606.19712v1 Announce Type: new Abstract: While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are… 29 arXiv — NLP / Computation & Language research 13d ago Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards arXiv:2606.19352v1 Announce Type: new Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented… 16 arXiv — NLP / Computation & Language research 13d ago LaViSA: A Language and Vision Structural Ambiguity Benchmark arXiv:2606.19552v1 Announce Type: new Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving… 22 arXiv — NLP / Computation & Language research 13d ago REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection arXiv:2606.19881v1 Announce Type: new Abstract: Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector… 38 arXiv — NLP / Computation & Language research 13d ago The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse arXiv:2606.20255v1 Announce Type: new Abstract: We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for… 34 arXiv — NLP / Computation & Language research 13d ago Benchmarking Agentic Review Systems arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems… 15 arXiv — NLP / Computation & Language research 13d ago CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models arXiv:2606.19788v1 Announce Type: cross Abstract: We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies,… 29 arXiv — NLP / Computation & Language research 13d ago JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines arXiv:2606.19830v1 Announce Type: cross Abstract: Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to… 6 arXiv — NLP / Computation & Language research 13d ago TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law arXiv:2507.00875v3 Announce Type: replace Abstract: Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology,… 38 arXiv — NLP / Computation & Language research 13d ago ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and… 22 Hugging Face Daily Papers research 13d ago FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines Abstract FAPO optimizes LLM pipelines by combining prompt editing with structural changes, demonstrating superior performance across multiple benchmarks and security tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-step LLM pipelines fail through interactions among… 38 Hugging Face Daily Papers research 13d ago FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining Abstract FreeStyle is a scalable dual-reference generation framework that uses community LoRA mining to create large-scale style-content triplets while addressing content leakage through disentanglement mechanisms and a comprehensive benchmark. Generated by… 16 Hugging Face Daily Papers research 13d ago Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 27 Hugging Face Daily Papers research 13d ago REVES: REvision and VErification--Augmented Training for Test-Time Scaling Abstract A two-stage iterative framework alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems. Generated by… 23 r/LocalLLaMA community 14d ago Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads rtk , headroom , and caveman keep showing up whenever someone posts about cutting their token bill 60-90%. I wanted to know what they save on an actual bill instead of a benchmark, so I replayed all three over my own Claude Code history. My corpus was 500 of my own Claude Code… 11 r/MachineLearning community 14d ago Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D] I have been thinking a lot about how poorly isolated benchmark metrics capture real conversational system quality once models are deployed into multi-turn environments. You can have strong STT scores, decent latency, high task completion rates, and still end up with… 25 r/LocalLLaMA community 14d ago GLM-5.2 Is The Best Open Weight Creative Writing Model As Per Sam Paech's Creative Writing Benchmark on EQ Bench: https://eqbench.com/creative_writing.html   submitted by   /u/Few_Painter_5588 [link]   [comments] 24 Hugging Face Daily Papers research 14d ago MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction Abstract 3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks. Generated by… 4 Hugging Face Daily Papers research 14d ago iOSWorld: A Benchmark for Personally Intelligent Phone Agents Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be… 6 Hugging Face Daily Papers research 14d ago MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and… 29 Hugging Face Daily Papers research 14d ago A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets Abstract A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predictive code… 17 r/LocalLLaMA community 14d ago Le Chaton Fat Flash local when? We are very happy with Le Chaton Fat SOTA but most of us would like to run it locally. You know, for privacy and sovereignty reasons. Does anyone have any updates when a local "flash" or "small" version is available?   submitted by   /u/corpo_monkey [link]  … 31 arXiv — Machine Learning research 14d ago ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets arXiv:2606.18338v1 Announce Type: new Abstract: The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule… 23 arXiv — Machine Learning research 14d ago Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting arXiv:2606.18367v1 Announce Type: new Abstract: Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on… 19 arXiv — Machine Learning research 14d ago TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under… 7 arXiv — Machine Learning research 14d ago MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes arXiv:2606.18640v1 Announce Type: new Abstract: Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized… 37 arXiv — NLP / Computation & Language research 14d ago GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory… 22 arXiv — Machine Learning research 14d ago A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI arXiv:2606.18970v1 Announce Type: new Abstract: Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However,… 38 arXiv — Machine Learning research 14d ago Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts arXiv:2606.19036v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection… 16 arXiv — NLP / Computation & Language research 14d ago VISUALSKILL: Multimodal Skills for Computer-Use Agents arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the… 19 Page 5 of 10 · 500 articles ← Newer Older →