News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow Hugging Face Daily Papers research 24d ago SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated… 30 Hugging Face Daily Papers research 24d ago Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback Abstract Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search systems iteratively interact… 34 arXiv — Machine Learning research 24d ago Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly… 27 arXiv — Machine Learning research 24d ago MacArena: Benchmarking Computer Use Agents on an Online macOS Environment arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,… 37 arXiv — NLP / Computation & Language research 24d ago RECAP: Regression Evaluation for Continual Adaptation of Prompts arXiv:2606.06698v1 Announce Type: cross Abstract: Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure… 38 arXiv — Machine Learning research 24d ago Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices arXiv:2606.07068v1 Announce Type: new Abstract: Background: Since 1990 many feature selection methods have been proposed across heterogeneous applications. To validate the usefulness of a new method, it needs to be compared against at least one baseline method from the existing… 32 arXiv — Machine Learning research 24d ago REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or… 12 arXiv — Machine Learning research 24d ago Decision-Aware Evaluation of Physics-Informed Surrogates arXiv:2606.07146v1 Announce Type: new Abstract: Physics-informed machine learning is often assessed by curve error, although engineering use depends on downstream decisions: ranking candidates, avoiding infeasible designs and limiting regret. We introduce pinn-gym, an open… 22 arXiv — NLP / Computation & Language research 24d ago Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests arXiv:2606.07379v1 Announce Type: cross Abstract: A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores… 4 arXiv — NLP / Computation & Language research 24d ago Re-Centering Humans in LLM Personalization arXiv:2606.06614v1 Announce Type: new Abstract: Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper,… 9 arXiv — NLP / Computation & Language research 24d ago UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in… 33 arXiv — NLP / Computation & Language research 24d ago Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single,… 19 arXiv — NLP / Computation & Language research 24d ago OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios arXiv:2606.06959v1 Announce Type: new Abstract: Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of… 5 arXiv — NLP / Computation & Language research 24d ago MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights arXiv:2606.07020v1 Announce Type: new Abstract: Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis.… 19 arXiv — NLP / Computation & Language research 24d ago Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling arXiv:2606.07040v1 Announce Type: new Abstract: Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query,… 20 arXiv — NLP / Computation & Language research 24d ago UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We… 37 arXiv — NLP / Computation & Language research 24d ago From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning arXiv:2606.07190v1 Announce Type: new Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect… 21 arXiv — NLP / Computation & Language research 24d ago Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation arXiv:2606.07057v1 Announce Type: cross Abstract: Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which… 34 arXiv — NLP / Computation & Language research 24d ago MMAE: A Massive Multitask Audio Editing Benchmark arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,… 8 arXiv — NLP / Computation & Language research 24d ago Reference-Free Evaluation of Taxonomies arXiv:2505.11470v3 Announce Type: replace Abstract: We introduce two reference-free metrics for quality evaluation of taxonomies in the absence of labels. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, addressing… 31 arXiv — NLP / Computation & Language research 24d ago SWE-IF: Aligning Code Evaluation with Human Preference arXiv:2510.07315v2 Announce Type: replace Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human… 14 Hugging Face Daily Papers research 27d ago SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models… 38 Hugging Face Daily Papers research 27d ago Benchmark Everything Everywhere All at Once Abstract Automated benchmark creation system generates diverse evaluation datasets with minimal human intervention, enabling continuous model assessment across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Benchmarks are fundamental for evaluating and advancing… 27 Hugging Face Daily Papers research 27d ago LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs Abstract PropMe framework evaluates language model memorization by distinguishing between forced reproduction capabilities and natural propensity, using SimpleTrace for deterministic attribution and propensity-transformed metrics across open models and datasets. Generated by… 15 arXiv — Machine Learning research 27d ago The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models arXiv:2606.05169v1 Announce Type: new Abstract: We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by… 30 arXiv — Machine Learning research 27d ago Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference arXiv:2606.05308v1 Announce Type: new Abstract: With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the… 25 arXiv — Machine Learning research 27d ago Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation arXiv:2606.05403v1 Announce Type: new Abstract: Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation,… 4 arXiv — Machine Learning research 27d ago Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents arXiv:2606.05558v1 Announce Type: new Abstract: Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation… 4 arXiv — Machine Learning research 27d ago Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions arXiv:2606.05692v1 Announce Type: new Abstract: Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on… 35 arXiv — Machine Learning research 27d ago Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data arXiv:2606.05781v1 Announce Type: new Abstract: Deploying frontier large language models (LLMs) for domain-specific structured evaluation tasks often incurs substantial latency, cost, and data privacy overhead. We present a hybrid framework that combines a fine-tuned small… 34 arXiv — Machine Learning research 27d ago GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis arXiv:2606.05860v1 Announce Type: new Abstract: Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically… 21 arXiv — NLP / Computation & Language research 27d ago PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis arXiv:2606.05176v1 Announce Type: new Abstract: While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In… 20 arXiv — NLP / Computation & Language research 27d ago ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces arXiv:2606.05402v1 Announce Type: new Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a… 30 arXiv — NLP / Computation & Language research 27d ago TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not… 32 arXiv — NLP / Computation & Language research 27d ago Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models arXiv:2606.05874v1 Announce Type: new Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios… 23 arXiv — NLP / Computation & Language research 27d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a… 9 arXiv — NLP / Computation & Language research 27d ago Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an… 13 arXiv — NLP / Computation & Language research 27d ago Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery arXiv:2606.06267v1 Announce Type: new Abstract: Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by… 23 arXiv — NLP / Computation & Language research 27d ago LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs arXiv:2606.06286v1 Announce Type: new Abstract: Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a… 26 arXiv — NLP / Computation & Language research 27d ago A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation arXiv:2606.06420v1 Announce Type: new Abstract: We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs… 38 The Information — AI news-outlet 27d ago Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, a level that would make it one of the most valuable privately held data center operators, The Information reported late Thursday . Brookfield Asset Management, KKR and… 28 The Information — AI news-outlet 27d ago Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, as it seeks to capitalize on soaring demand for the infrastructure needed to support artificial intelligence, according to people with knowledge of the deal. Brookfield… 34 Hugging Face Daily Papers research 28d ago Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by… 7 r/LocalLLaMA community 28d ago I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation If you’re building LLM apps and feel confused about when to use keyword search, embeddings, rerankers, or vector databases, this repo is for that. I built a docs-first repo on practical LLM system design patterns, covering pre-filtering, hybrid retrieval, rerankers, in-memory… 23 The Information — AI news-outlet 28d ago Fusion Startup Helion Nearly Triples Valuation to $15.5 Billion in Thrive-led Round Helion Energy, a nuclear fusion startup backed by OpenAI’s Sam Altman, still has to prove it can produce electricity to serve data centers and other customers. But investors seem confident it can deliver. The Everett, Wash.–based company said it has raised $465 million in… 33 Hugging Face Daily Papers research 28d ago PaintBench: Deterministic Evaluation of Precise Visual Editing Abstract PaintBench presents a scalable benchmark for precise visual editing tasks, revealing low performance across models and identifying key challenges in geometric transformations and structural manipulations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While current… 12 Hugging Face Daily Papers research 28d ago Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems Abstract Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are rapidly evolving from coding assistants… 21 arXiv — Machine Learning research 28d ago TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection arXiv:2606.04073v1 Announce Type: new Abstract: This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided \textbf{A}nomaly \textbf{D}etection, \textbf{TPA-AD}) for axle-box bearing time-series… 33 arXiv — Machine Learning research 28d ago Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification arXiv:2606.04110v1 Announce Type: new Abstract: Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both… 18 arXiv — Machine Learning research 28d ago KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not… 8 Page 7 of 10 · 500 articles ← Newer Older →