Tag

Funding

500 articles archived under #funding · RSS

Hugging Face Daily Papers research 24d ago

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated…

30
Hugging Face Daily Papers research 24d ago

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Abstract Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search systems iteratively interact…

34
arXiv — Machine Learning research 24d ago

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly…

27
arXiv — Machine Learning research 24d ago

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,…

37
arXiv — NLP / Computation & Language research 24d ago

RECAP: Regression Evaluation for Continual Adaptation of Prompts

arXiv:2606.06698v1 Announce Type: cross Abstract: Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure…

38
arXiv — Machine Learning research 24d ago

Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices

arXiv:2606.07068v1 Announce Type: new Abstract: Background: Since 1990 many feature selection methods have been proposed across heterogeneous applications. To validate the usefulness of a new method, it needs to be compared against at least one baseline method from the existing…

32
arXiv — Machine Learning research 24d ago

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or…

12
arXiv — Machine Learning research 24d ago

Decision-Aware Evaluation of Physics-Informed Surrogates

arXiv:2606.07146v1 Announce Type: new Abstract: Physics-informed machine learning is often assessed by curve error, although engineering use depends on downstream decisions: ranking candidates, avoiding infeasible designs and limiting regret. We introduce pinn-gym, an open…

22
arXiv — NLP / Computation & Language research 24d ago

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

arXiv:2606.07379v1 Announce Type: cross Abstract: A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores…

4
arXiv — NLP / Computation & Language research 24d ago

Re-Centering Humans in LLM Personalization

arXiv:2606.06614v1 Announce Type: new Abstract: Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper,…

9
arXiv — NLP / Computation & Language research 24d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in…

33
arXiv — NLP / Computation & Language research 24d ago

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single,…

19
arXiv — NLP / Computation & Language research 24d ago

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

arXiv:2606.06959v1 Announce Type: new Abstract: Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of…

5
arXiv — NLP / Computation & Language research 24d ago

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

arXiv:2606.07020v1 Announce Type: new Abstract: Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis.…

19
arXiv — NLP / Computation & Language research 24d ago

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

arXiv:2606.07040v1 Announce Type: new Abstract: Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query,…

20
arXiv — NLP / Computation & Language research 24d ago

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We…

37
arXiv — NLP / Computation & Language research 24d ago

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

arXiv:2606.07190v1 Announce Type: new Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect…

21
arXiv — NLP / Computation & Language research 24d ago

Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation

arXiv:2606.07057v1 Announce Type: cross Abstract: Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which…

34
arXiv — NLP / Computation & Language research 24d ago

MMAE: A Massive Multitask Audio Editing Benchmark

arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,…

8
arXiv — NLP / Computation & Language research 24d ago

Reference-Free Evaluation of Taxonomies

arXiv:2505.11470v3 Announce Type: replace Abstract: We introduce two reference-free metrics for quality evaluation of taxonomies in the absence of labels. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, addressing…

31
arXiv — NLP / Computation & Language research 24d ago

SWE-IF: Aligning Code Evaluation with Human Preference

arXiv:2510.07315v2 Announce Type: replace Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human…

14
Hugging Face Daily Papers research 27d ago

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

38
Hugging Face Daily Papers research 27d ago

Benchmark Everything Everywhere All at Once

Abstract Automated benchmark creation system generates diverse evaluation datasets with minimal human intervention, enabling continuous model assessment across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Benchmarks are fundamental for evaluating and advancing…

27
Hugging Face Daily Papers research 27d ago

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

Abstract PropMe framework evaluates language model memorization by distinguishing between forced reproduction capabilities and natural propensity, using SimpleTrace for deterministic attribution and propensity-transformed metrics across open models and datasets. Generated by…

15
arXiv — Machine Learning research 27d ago

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

arXiv:2606.05169v1 Announce Type: new Abstract: We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by…

30
arXiv — Machine Learning research 27d ago

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

arXiv:2606.05308v1 Announce Type: new Abstract: With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the…

25
arXiv — Machine Learning research 27d ago

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

arXiv:2606.05403v1 Announce Type: new Abstract: Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation,…

4
arXiv — Machine Learning research 27d ago

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

arXiv:2606.05558v1 Announce Type: new Abstract: Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation…

4
arXiv — Machine Learning research 27d ago

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

arXiv:2606.05692v1 Announce Type: new Abstract: Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on…

35
arXiv — Machine Learning research 27d ago

Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data

arXiv:2606.05781v1 Announce Type: new Abstract: Deploying frontier large language models (LLMs) for domain-specific structured evaluation tasks often incurs substantial latency, cost, and data privacy overhead. We present a hybrid framework that combines a fine-tuned small…

34
arXiv — Machine Learning research 27d ago

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

arXiv:2606.05860v1 Announce Type: new Abstract: Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically…

21
arXiv — NLP / Computation & Language research 27d ago

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

arXiv:2606.05176v1 Announce Type: new Abstract: While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In…

20
arXiv — NLP / Computation & Language research 27d ago

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

arXiv:2606.05402v1 Announce Type: new Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a…

30
arXiv — NLP / Computation & Language research 27d ago

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not…

32
arXiv — NLP / Computation & Language research 27d ago

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

arXiv:2606.05874v1 Announce Type: new Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios…

23
arXiv — NLP / Computation & Language research 27d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a…

9
arXiv — NLP / Computation & Language research 27d ago

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an…

13
arXiv — NLP / Computation & Language research 27d ago

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

arXiv:2606.06267v1 Announce Type: new Abstract: Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by…

23
arXiv — NLP / Computation & Language research 27d ago

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

arXiv:2606.06286v1 Announce Type: new Abstract: Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a…

26
arXiv — NLP / Computation & Language research 27d ago

A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

arXiv:2606.06420v1 Announce Type: new Abstract: We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs…

38
The Information — AI news-outlet 27d ago

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, a level that would make it one of the most valuable privately held data center operators, The Information reported late Thursday . Brookfield Asset Management, KKR and…

28
The Information — AI news-outlet 27d ago

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, as it seeks to capitalize on soaring demand for the infrastructure needed to support artificial intelligence, according to people with knowledge of the deal. Brookfield…

34
Hugging Face Daily Papers research 28d ago

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by…

7
r/LocalLLaMA community 28d ago

I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation

If you’re building LLM apps and feel confused about when to use keyword search, embeddings, rerankers, or vector databases, this repo is for that. I built a docs-first repo on practical LLM system design patterns, covering pre-filtering, hybrid retrieval, rerankers, in-memory…

23
The Information — AI news-outlet 28d ago

Fusion Startup Helion Nearly Triples Valuation to $15.5 Billion in Thrive-led Round

Helion Energy, a nuclear fusion startup backed by OpenAI’s Sam Altman, still has to prove it can produce electricity to serve data centers and other customers. But investors seem confident it can deliver. The Everett, Wash.–based company said it has raised $465 million in…

33
Hugging Face Daily Papers research 28d ago

PaintBench: Deterministic Evaluation of Precise Visual Editing

Abstract PaintBench presents a scalable benchmark for precise visual editing tasks, revealing low performance across models and identifying key challenges in geometric transformations and structural manipulations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While current…

12
Hugging Face Daily Papers research 28d ago

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Abstract Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are rapidly evolving from coding assistants…

21
arXiv — Machine Learning research 28d ago

TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection

arXiv:2606.04073v1 Announce Type: new Abstract: This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided \textbf{A}nomaly \textbf{D}etection, \textbf{TPA-AD}) for axle-box bearing time-series…

33
arXiv — Machine Learning research 28d ago

Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification

arXiv:2606.04110v1 Announce Type: new Abstract: Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both…

18
arXiv — Machine Learning research 28d ago

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not…

8

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

RECAP: Regression Evaluation for Continual Adaptation of Prompts

Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

Decision-Aware Evaluation of Physics-Informed Surrogates

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Re-Centering Humans in LLM Personalization

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation

MMAE: A Massive Multitask Audio Editing Benchmark

Reference-Free Evaluation of Taxonomies

SWE-IF: Aligning Code Evaluation with Human Preference

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Benchmark Everything Everywhere All at Once

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation

Fusion Startup Helion Nearly Triples Valuation to $15.5 Billion in Thrive-led Round

PaintBench: Deterministic Evaluation of Precise Visual Editing

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection

Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models