Tag

Benchmark

500 articles archived under #benchmark · RSS

Hugging Face Daily Papers research 20d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search…

26
Hugging Face Daily Papers research 20d ago

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Abstract EvoArena benchmark and EvoMem memory paradigm address the challenge of dynamic environments in LLM agents by modeling progressive updates and structured memory evolution, showing improved performance on evolving tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large…

5
Hugging Face Daily Papers research 20d ago

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Abstract WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods. Generated by…

38
Hugging Face Daily Papers research 20d ago

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Abstract InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks. Generated by…

36
r/LocalLLaMA community 21d ago

New models released: Nex-N2 Pro 397B and Nex-N2 Mini 35B

They are FTs of Qwen3.5 and the benchmarks look pretty good https://huggingface.co/nex-agi/Nex-N2-mini https://huggingface.co/nex-agi/Nex-N2-Pro   submitted by   /u/1ncehost [link]   [comments]

23
r/LocalLLaMA community 21d ago

DiffusionGemma under real workloads feels very different from benchmark demos

okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100…

29
Hugging Face Daily Papers research 21d ago

τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Abstract A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models. Generated by…

6
Smol AI News news-outlet 21d ago

not much happened today

**Anthropic** reversed its covert degradation policy on **Claude Fable 5** after public backlash, sparking debates on governance, transparency, and access to frontier AI models. The model shows strong capabilities with mixed benchmark results, including **87.8% on WeirdML** and…

19
Smol AI News news-outlet 21d ago

not much happened today

**Anthropic's Fable/Mythos export-control crisis** dominates AI news, highlighting the intersection of **national security** and frontier model access. Technical voices like **François Chollet** criticize opaque regulatory actions and advocate for **standardized benchmarks for…

6
Hugging Face Daily Papers research 21d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

16
arXiv — Machine Learning research 21d ago

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

arXiv:2606.11382v1 Announce Type: new Abstract: Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases,…

20
arXiv — NLP / Computation & Language research 21d ago

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

arXiv:2606.11562v1 Announce Type: cross Abstract: Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node…

37
arXiv — Machine Learning research 21d ago

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

arXiv:2606.11657v1 Announce Type: new Abstract: Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style…

29
arXiv — Machine Learning research 21d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

arXiv:2606.12344v1 Announce Type: new Abstract: General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch,…

27
arXiv — NLP / Computation & Language research 21d ago

Benchmarking Large Language Models for Safety Data Extraction

arXiv:2606.11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks…

27
arXiv — NLP / Computation & Language research 21d ago

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

arXiv:2606.11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting…

29
arXiv — NLP / Computation & Language research 21d ago

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

arXiv:2606.11232v1 Announce Type: new Abstract: Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same…

14
arXiv — NLP / Computation & Language research 21d ago

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

arXiv:2606.11257v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use,…

35
arXiv — NLP / Computation & Language research 21d ago

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in…

20
arXiv — NLP / Computation & Language research 21d ago

AI Coding Agents Can Reproduce Social Science Findings

arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks…

8
arXiv — NLP / Computation & Language research 21d ago

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

arXiv:2606.11678v1 Announce Type: new Abstract: Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment?…

12
arXiv — NLP / Computation & Language research 21d ago

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

arXiv:2606.12117v1 Announce Type: new Abstract: Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the…

27
Hugging Face Daily Papers research 21d ago

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Abstract A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

37
r/LocalLLaMA community 21d ago

How can Deepseek v4 top the coding leaderboards and still sit 8 months behind the frontier?

Two numbers on this model that don't sit comfortably with each other. The Pro config posts coding scores near the top of every board, 80.6 on SWE-bench Verified and 93.5 on LiveCodeBench. Then CAISI ran it across a spread of domains and landed on it being roughly eight months…

20
Hugging Face Daily Papers research 21d ago

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Abstract TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard…

6
r/LocalLLaMA community 21d ago

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

I kept wanting to talk to my local models instead of typing, but every voice setup wanted a GPU, shipped my audio to the cloud, or was macOS-only. So I built one that's none of those — and I benchmarked it, so these are real measured numbers, not vibes. One command installs the…

12
Hugging Face Daily Papers research 21d ago

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Abstract Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach. Generated by…

35
Hugging Face Daily Papers research 21d ago

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Abstract InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
r/LocalLLaMA community 21d ago

Tried to benchmark Google’s new on-device dictation models (Eloquent) and basically couldn’t

I tried to benchmark Google’s new on-device dictation app (Eloquent) and basically couldn’t. It drops about half of my dictations. tl;dr Full results are 👉 here . Background: Google shipped a new fully‑local dictation app yesterday with proprietary new models , so I was excited…

5
Hugging Face Daily Papers research 22d ago

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Abstract SkillHarm is a benchmark for evaluating skill-based attacks across the skill-use lifecycle, demonstrating significant vulnerabilities in current agents with attack success rates up to 86.3%. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agent skills occupy a privileged…

36
r/LocalLLaMA community 22d ago

SenseNova U1 dropped an infographic-specific finetune

it's the same U1-8B-MoT base with an extended MT (multi-task) training phase focused on structured visual output. the benchmark jumps are significant: IGenBench I-ACC (infographic accuracy) : 4.2👉17.0 (4x) Chart Understanding: 51.3👉69.5Text Rendering: 39.8👉46.6Overall…

32
r/LocalLLaMA community 22d ago

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

Bonsai LM (1-bit and 1.58-bitLLMs) benchmark on Jetson Orin Nano Super Just released a deep benchmark of 5 Bonsai LM models (1.7B → ~8B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN A thread! So, Bonsai LM models…

29
r/LocalLLaMA community 22d ago

Cohere released North Mini Code: It's first Open-Source Agentic Coding Model

Small: 30 billion parameters, 3B active. Efficient: Benchmarks to 33.4 on the Artificial Analysis Coding Index, competitive among similar sized models. Open Source: Apache 2.0 license HF: https://huggingface.co/CohereLabs/North-Mini-Code-1.0   submitted by  …

8
r/MachineLearning community 22d ago

Introducing Papers Without Code [P]

Hi, Niels here from the open-source team at Hugging Face. I've recently relaunched paperswithcode.co as a source for finding the state of the art (SOTA) across various AI domains, from 3D generation to AI agents. This is done by automatically parsing research papers published on…

36
Hugging Face Daily Papers research 22d ago

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Abstract MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead. Generated by…

33
LangChain releases dev-tools 22d ago

langchain-groq==1.1.3

Changes since langchain-groq==1.1.2 release(groq): 1.1.3 ( #38009 ) hotfix(openai): min core dep ( #37990 ) test(langchain,partners): disable pytest-benchmark under xdist to silence PytestBenchmarkWarning ( #37901 ) chore(model-profiles): refresh model profile data ( #37726 )…

10
Hugging Face Daily Papers research 22d ago

WorldOlympiad: Can Your World Model Survive a Triathlon?

Abstract WorldOlympiad presents a comprehensive benchmark for evaluating video-based world models across physical faithfulness, geometric consistency, and interaction fidelity, revealing significant gaps in current generative models' capabilities. Generated by…

13
arXiv — Machine Learning research 22d ago

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

arXiv:2606.09863v1 Announce Type: new Abstract: LLM agents can fail silently by asserting task completion when the environment state shows otherwise. We study this failure mode, false success, across two agent benchmarks: 9,876 tau2-bench trajectories from 8 model families and…

13
arXiv — Machine Learning research 22d ago

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

arXiv:2606.09878v1 Announce Type: new Abstract: Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model…

20
arXiv — NLP / Computation & Language research 22d ago

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

arXiv:2606.09890v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior…

17
arXiv — Machine Learning research 22d ago

Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark

arXiv:2606.10084v1 Announce Type: new Abstract: This work presents a divide-and-conquer modeling strategy for the CTF-4-Science Lorenz benchmark, which evaluates chaotic-system prediction across twelve hidden scores and five scenario families: clean forecasting, noisy…

9
arXiv — Machine Learning research 22d ago

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We…

20
arXiv — Machine Learning research 22d ago

When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice

arXiv:2606.10249v1 Announce Type: new Abstract: We examine whether graph neural network (GNN) design rules generalize across benchmark families by studying aggregator selection (sum, mean, max) on 24 node-classification datasets spanning citation, heterophilic, LINKX…

22
arXiv — NLP / Computation & Language research 22d ago

When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

arXiv:2606.10287v1 Announce Type: cross Abstract: Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits$@$k, and Mean Rank, which often produce conflicting model orderings across…

6
arXiv — NLP / Computation & Language research 22d ago

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

arXiv:2606.10061v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research…

27
arXiv — NLP / Computation & Language research 22d ago

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

arXiv:2606.10400v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than…

23
arXiv — NLP / Computation & Language research 22d ago

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

arXiv:2606.10403v1 Announce Type: new Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung)…

34
arXiv — NLP / Computation & Language research 22d ago

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

arXiv:2606.10460v1 Announce Type: new Abstract: Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired…

20
arXiv — NLP / Computation & Language research 22d ago

Benchmarking Knowledge Editing using Logical Rules

arXiv:2606.10554v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are…

15
arXiv — NLP / Computation & Language research 22d ago

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

arXiv:2606.10657v1 Announce Type: new Abstract: Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact…

10

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

InterleaveThinker: Reinforcing Agentic Interleaved Generation

New models released: Nex-N2 Pro 397B and Nex-N2 Mini 35B

DiffusionGemma under real workloads feels very different from benchmark demos

τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

not much happened today

not much happened today

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Benchmarking Large Language Models for Safety Data Extraction

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

AI Coding Agents Can Reproduce Social Science Findings

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

How can Deepseek v4 top the coding leaderboards and still sit 8 months behind the frontier?

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Tried to benchmark Google’s new on-device dictation models (Eloquent) and basically couldn’t

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

SenseNova U1 dropped an infographic-specific finetune

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

Cohere released North Mini Code: It's first Open-Source Agentic Coding Model

Introducing Papers Without Code [P]

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

langchain-groq==1.1.3

WorldOlympiad: Can Your World Model Survive a Triathlon?

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice

When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Benchmarking Knowledge Editing using Logical Rules

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval