News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow Hugging Face Daily Papers research 20d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search… 26 Hugging Face Daily Papers research 20d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments Abstract EvoArena benchmark and EvoMem memory paradigm address the challenge of dynamic environments in LLM agents by modeling progressive updates and structured memory evolution, showing improved performance on evolving tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large… 5 Hugging Face Daily Papers research 20d ago WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Abstract WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods. Generated by… 38 Hugging Face Daily Papers research 20d ago InterleaveThinker: Reinforcing Agentic Interleaved Generation Abstract InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks. Generated by… 36 r/LocalLLaMA community 21d ago New models released: Nex-N2 Pro 397B and Nex-N2 Mini 35B They are FTs of Qwen3.5 and the benchmarks look pretty good https://huggingface.co/nex-agi/Nex-N2-mini https://huggingface.co/nex-agi/Nex-N2-Pro   submitted by   /u/1ncehost [link]   [comments] 23 r/LocalLLaMA community 21d ago DiffusionGemma under real workloads feels very different from benchmark demos okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100… 29 Hugging Face Daily Papers research 21d ago τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems Abstract A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models. Generated by… 6 Smol AI News news-outlet 21d ago not much happened today **Anthropic** reversed its covert degradation policy on **Claude Fable 5** after public backlash, sparking debates on governance, transparency, and access to frontier AI models. The model shows strong capabilities with mixed benchmark results, including **87.8% on WeirdML** and… 19 Smol AI News news-outlet 21d ago not much happened today **Anthropic's Fable/Mythos export-control crisis** dominates AI news, highlighting the intersection of **national security** and frontier model access. Technical voices like **François Chollet** criticize opaque regulatory actions and advocate for **standardized benchmarks for… 6 Hugging Face Daily Papers research 21d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 16 arXiv — Machine Learning research 21d ago GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction arXiv:2606.11382v1 Announce Type: new Abstract: Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases,… 20 arXiv — NLP / Computation & Language research 21d ago GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs arXiv:2606.11562v1 Announce Type: cross Abstract: Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node… 37 arXiv — Machine Learning research 21d ago Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics arXiv:2606.11657v1 Announce Type: new Abstract: Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style… 29 arXiv — Machine Learning research 21d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks arXiv:2606.12344v1 Announce Type: new Abstract: General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch,… 27 arXiv — NLP / Computation & Language research 21d ago Benchmarking Large Language Models for Safety Data Extraction arXiv:2606.11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks… 27 arXiv — NLP / Computation & Language research 21d ago BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts arXiv:2606.11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting… 29 arXiv — NLP / Computation & Language research 21d ago Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs arXiv:2606.11232v1 Announce Type: new Abstract: Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same… 14 arXiv — NLP / Computation & Language research 21d ago Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite arXiv:2606.11257v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use,… 35 arXiv — NLP / Computation & Language research 21d ago Agent Skill Evaluation and Evolution: Frameworks and Benchmarks arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in… 20 arXiv — NLP / Computation & Language research 21d ago AI Coding Agents Can Reproduce Social Science Findings arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks… 8 arXiv — NLP / Computation & Language research 21d ago Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment arXiv:2606.11678v1 Announce Type: new Abstract: Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment?… 12 arXiv — NLP / Computation & Language research 21d ago Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation arXiv:2606.12117v1 Announce Type: new Abstract: Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the… 27 Hugging Face Daily Papers research 21d ago ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics Abstract A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 37 r/LocalLLaMA community 21d ago How can Deepseek v4 top the coding leaderboards and still sit 8 months behind the frontier? Two numbers on this model that don't sit comfortably with each other. The Pro config posts coding scores near the top of every board, 80.6 on SWE-bench Verified and 93.5 on LiveCodeBench. Then CAISI ran it across a spread of domains and landed on it being roughly eight months… 20 Hugging Face Daily Papers research 21d ago TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders Abstract TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard… 6 r/LocalLLaMA community 21d ago I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3) I kept wanting to talk to my local models instead of typing, but every voice setup wanted a GPU, shipped my audio to the cloud, or was macOS-only. So I built one that's none of those — and I benchmarked it, so these are real measured numbers, not vibes. One command installs the… 12 Hugging Face Daily Papers research 21d ago Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models Abstract Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach. Generated by… 35 Hugging Face Daily Papers research 21d ago InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning Abstract InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 18 r/LocalLLaMA community 21d ago Tried to benchmark Google’s new on-device dictation models (Eloquent) and basically couldn’t I tried to benchmark Google’s new on-device dictation app (Eloquent) and basically couldn’t. It drops about half of my dictations. tl;dr Full results are 👉 here . Background: Google shipped a new fully‑local dictation app yesterday with proprietary new models , so I was excited… 5 Hugging Face Daily Papers research 22d ago SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction Abstract SkillHarm is a benchmark for evaluating skill-based attacks across the skill-use lifecycle, demonstrating significant vulnerabilities in current agents with attack success rates up to 86.3%. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agent skills occupy a privileged… 36 r/LocalLLaMA community 22d ago SenseNova U1 dropped an infographic-specific finetune it's the same U1-8B-MoT base with an extended MT (multi-task) training phase focused on structured visual output. the benchmark jumps are significant: IGenBench I-ACC (infographic accuracy) : 4.2👉17.0 (4x) Chart Understanding: 51.3👉69.5Text Rendering: 39.8👉46.6Overall… 32 r/LocalLLaMA community 22d ago 1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM Bonsai LM (1-bit and 1.58-bitLLMs) benchmark on Jetson Orin Nano Super Just released a deep benchmark of 5 Bonsai LM models (1.7B → ~8B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN A thread! So, Bonsai LM models… 29 r/LocalLLaMA community 22d ago Cohere released North Mini Code: It's first Open-Source Agentic Coding Model Small: 30 billion parameters, 3B active. Efficient: Benchmarks to 33.4 on the Artificial Analysis Coding Index, competitive among similar sized models. Open Source: Apache 2.0 license HF: https://huggingface.co/CohereLabs/North-Mini-Code-1.0   submitted by  … 8 r/MachineLearning community 22d ago Introducing Papers Without Code [P] Hi, Niels here from the open-source team at Hugging Face. I've recently relaunched paperswithcode.co as a source for finding the state of the art (SOTA) across various AI domains, from 3D generation to AI agents. This is done by automatically parsing research papers published on… 36 Hugging Face Daily Papers research 22d ago MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism Abstract MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead. Generated by… 33 LangChain releases dev-tools 22d ago langchain-groq==1.1.3 Changes since langchain-groq==1.1.2 release(groq): 1.1.3 ( #38009 ) hotfix(openai): min core dep ( #37990 ) test(langchain,partners): disable pytest-benchmark under xdist to silence PytestBenchmarkWarning ( #37901 ) chore(model-profiles): refresh model profile data ( #37726 )… 10 Hugging Face Daily Papers research 22d ago WorldOlympiad: Can Your World Model Survive a Triathlon? Abstract WorldOlympiad presents a comprehensive benchmark for evaluating video-based world models across physical faithfulness, geometric consistency, and interaction fidelity, revealing significant gaps in current generative models' capabilities. Generated by… 13 arXiv — Machine Learning research 22d ago From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents arXiv:2606.09863v1 Announce Type: new Abstract: LLM agents can fail silently by asserting task completion when the environment state shows otherwise. We study this failure mode, false success, across two agent benchmarks: 9,876 tau2-bench trajectories from 8 model families and… 13 arXiv — Machine Learning research 22d ago FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses arXiv:2606.09878v1 Announce Type: new Abstract: Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model… 20 arXiv — NLP / Computation & Language research 22d ago PreAct-Bench: Benchmarking Predictive Monitoring in LLMs arXiv:2606.09890v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior… 17 arXiv — Machine Learning research 22d ago Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark arXiv:2606.10084v1 Announce Type: new Abstract: This work presents a divide-and-conquer modeling strategy for the CTF-4-Science Lorenz benchmark, which evaluates chaotic-system prediction across twelve hidden scores and five scenario families: clean forecasting, noisy… 9 arXiv — Machine Learning research 22d ago MMClima: A Framework for Multimodal Climate Science Data and Evaluation arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We… 20 arXiv — Machine Learning research 22d ago When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice arXiv:2606.10249v1 Announce Type: new Abstract: We examine whether graph neural network (GNN) design rules generalize across benchmark families by studying aggregator selection (sum, mean, max) on 24 node-classification datasets spanning citation, heterophilic, LINKX… 22 arXiv — NLP / Computation & Language research 22d ago When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking arXiv:2606.10287v1 Announce Type: cross Abstract: Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits$@$k, and Mean Rank, which often produce conflicting model orderings across… 6 arXiv — NLP / Computation & Language research 22d ago BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts arXiv:2606.10061v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research… 27 arXiv — NLP / Computation & Language research 22d ago Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark arXiv:2606.10400v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than… 23 arXiv — NLP / Computation & Language research 22d ago KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty arXiv:2606.10403v1 Announce Type: new Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung)… 34 arXiv — NLP / Computation & Language research 22d ago LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake arXiv:2606.10460v1 Announce Type: new Abstract: Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired… 20 arXiv — NLP / Computation & Language research 22d ago Benchmarking Knowledge Editing using Logical Rules arXiv:2606.10554v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are… 15 arXiv — NLP / Computation & Language research 22d ago Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval arXiv:2606.10657v1 Announce Type: new Abstract: Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact… 10 Page 8 of 10 · 500 articles ← Newer Older →