Tag

Funding

500 articles archived under #funding · RSS

arXiv — Machine Learning research 2h ago

TRIE: An Evaluation Framework for Stochastic PDE Surrogates

arXiv:2607.00196v1 Announce Type: new Abstract: Many scientific systems exhibit uncertainty from stochastic forcing, unresolved degrees of freedom, or imperfect observations, making reliable surrogate forecasting fundamentally distributional rather than pointwise. For such…

37
arXiv — NLP / Computation & Language research 2h ago

Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions

arXiv:2607.00304v1 Announce Type: cross Abstract: The bias-reliability tradeoff conjectures that LLM evaluation systems are constrained in (gamma, H, CV) space, where evaluator coupling (gamma), strategy diversity (H), and small-sample measurement reliability (CV(N)) cannot be…

7
arXiv — Machine Learning research 2h ago

Generative Refinement for Low-Budget Black-Box Optimization

arXiv:2607.00691v1 Announce Type: new Abstract: Black-box optimization is a fundamental science and engineering tool that makes it possible to optimize objectives without gradient information. Unfortunately, as it often requires many function evaluations, it can be challenging…

31
arXiv — Machine Learning research 2h ago

Detecting the Undetectable: Enhancing Unsupervised time series Anomaly Detection via Active Learning

arXiv:2607.00720v1 Announce Type: new Abstract: Despite the increasing sophistication of industrial AI systems, the ability to reliably detect subtle and noisy anomalies in complex time series data remains a critical yet unresolved challenge. In large-scale industrial…

13
arXiv — Machine Learning research 2h ago

Constrained Bayesian Optimisation with Multiple Information Sources

arXiv:2607.00865v1 Announce Type: new Abstract: Bayesian Optimisation (BO) under unknown constraints is particularly challenging when feasible regions are small. In such settings, existing methods that typically rely solely on evaluations of the true objective and constraints…

32
arXiv — Machine Learning research 2h ago

LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning

arXiv:2607.00958v1 Announce Type: new Abstract: Time series are central to modern data mining applications, from industrial telemetry and server metrics to finance and physiology, yet time-series self-supervised learning often depends on view and augmentation choices that encode…

14
arXiv — NLP / Computation & Language research 2h ago

Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth

arXiv:2607.00139v1 Announce Type: new Abstract: The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, high-stakes domains. This is particularly acute for Arabic sociolinguistic knowledge: credible grading requires not only…

20
arXiv — NLP / Computation & Language research 2h ago

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

arXiv:2607.00171v1 Announce Type: new Abstract: Text embeddings are standard for semantic similarity tasks, yet their evaluation remains an open challenge. Current benchmarks are static, cover only a limited set of languages, are often domain-specific, susceptible to…

4
arXiv — NLP / Computation & Language research 2h ago

SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework

arXiv:2607.00274v1 Announce Type: new Abstract: Effective writing feedback is among the strongest drivers of student learning, yet producing it at scale is labor-intensive. LLMs offer a natural path to scaling writing support, but two gaps stand in the way: few public corpora…

10
arXiv — NLP / Computation & Language research 2h ago

Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

arXiv:2607.00368v1 Announce Type: new Abstract: Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity,…

12
arXiv — NLP / Computation & Language research 2h ago

Auditing Forgetting in Limited Memory Language Models

arXiv:2607.00605v1 Announce Type: new Abstract: Limited Memory Language Models (LMLMs) externalize factual knowledge to a database to enable deletion-based unlearning without retraining. Existing evaluations measure post-deletion correctness in aggregate and cannot tell whether…

5
arXiv — NLP / Computation & Language research 2h ago

MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors

arXiv:2607.00848v1 Announce Type: new Abstract: In this opinion paper, we propose MetaHOPE, an error severity-aware annotation framework for evaluating metaphor translations. Metaphors present challenges for machine translation (MT) and natural language understanding and…

21
arXiv — NLP / Computation & Language research 2h ago

Quantifying the Affective Gap: A Zero-Shot Evaluation of LLMs on Fine-Grained Emotion Taxonomies

arXiv:2607.00968v1 Announce Type: new Abstract: Emotion recognition in natural language is a foundational challenge in affective computing, with critical implications for human-computer interaction, mental health support, and conversational AI. This paper presents a rigorous,…

20
arXiv — NLP / Computation & Language research 2h ago

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

arXiv:2607.01103v1 Announce Type: new Abstract: Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration…

12
arXiv — NLP / Computation & Language research 2h ago

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded…

14
arXiv — NLP / Computation & Language research 2h ago

Measuring the Gap Between Human and LLM Research Ideas

arXiv:2607.01233v1 Announce Type: new Abstract: LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human…

8
arXiv — NLP / Computation & Language research 2h ago

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

arXiv:2607.01161v1 Announce Type: cross Abstract: Cross-lingual speaker verification (SV) systems typically exhibit performance degradation when enrollment and test utterances are spoken in different languages. However, standard evaluation protocols confound language mismatch…

16
arXiv — NLP / Computation & Language research 2h ago

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

arXiv:2510.24636v3 Announce Type: replace Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and…

34
TechCrunch — AI news-outlet 15h ago

Venice AI becomes a unicorn with $65M Series A as its privacy-first AI platform takes off

Venice AI is already profitable, with annualized run-rate revenues of over $70 million, CEO Erik Voorhees said.

17
Hugging Face Daily Papers research 19h ago

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

Abstract Procedural memory enhances LLM agents on workplace tasks through skill transfer across roles and models, with varying generalization capabilities affecting deployment strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Procedural memory is increasingly used to…

22
arXiv — NLP / Computation & Language research 1d ago

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

arXiv:2606.30814v1 Announce Type: new Abstract: Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration…

21
arXiv — NLP / Computation & Language research 1d ago

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

arXiv:2606.30887v1 Announce Type: new Abstract: Large language models show promise for mental health support, yet therapeutic quality improves only when evaluation functions as an actionable control signal rather than a passive metric. We introduce a framework that formulates…

32
arXiv — NLP / Computation & Language research 1d ago

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with…

7
arXiv — NLP / Computation & Language research 1d ago

Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law

arXiv:2606.31250v1 Announce Type: new Abstract: Large language models (LLM) trained on web-scale corpora generate output that may infringe copyright, yet existing technical safeguards focus narrowly on verbatim memorisation. EU copyright doctrine applies a broader standards:…

36
arXiv — NLP / Computation & Language research 1d ago

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations…

37
arXiv — NLP / Computation & Language research 1d ago

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

arXiv:2606.31644v1 Announce Type: new Abstract: As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations…

5
arXiv — NLP / Computation & Language research 1d ago

Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management

arXiv:2606.31692v1 Announce Type: new Abstract: This paper presents an overview of the second edition of the TalentCLEF challenge, organized as a Lab at the Conference and Labs of the Evaluation Forum (CLEF) 2026. TalentCLEF is an initiative aimed at advancing Natural Language…

19
arXiv — NLP / Computation & Language research 1d ago

Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian

arXiv:2606.31718v1 Announce Type: new Abstract: Relation extraction (RE) for low-resource languages is typically constrained by the lack of annotated corpora. We investigate the feasibility of cross-lingual RE for Romanian by combining automatic dataset translation with large…

38
arXiv — NLP / Computation & Language research 1d ago

STEB: Style Text Embedding Benchmark

arXiv:2606.31741v1 Announce Type: new Abstract: While semantic embeddings are rigorously evaluated on the Massive Text Embedding Benchmark, the evaluation of style embeddings remains fragmented, with each work relying on their own set of tasks and datasets. To bridge this gap,…

27
arXiv — NLP / Computation & Language research 1d ago

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper…

25
arXiv — NLP / Computation & Language research 1d ago

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite…

29
arXiv — NLP / Computation & Language research 1d ago

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

arXiv:2504.07385v3 Announce Type: replace Abstract: As Large Language Models (LLMs) become increasingly used for question-answering (QA), relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. Meanwhile,…

26
arXiv — NLP / Computation & Language research 1d ago

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

arXiv:2602.06625v2 Announce Type: replace Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and…

7
TechCrunch — AI news-outlet 1d ago

Wayve launches $85M employee tender offer at $8.5B valuation

Wayve’s offering is part of a growing trend of AI startups using employee tenders as a strategic tool to attract and retain talent.

31
Hugging Face Daily Papers research 1d ago

Dockerless: Environment-Free Program Verifier for Coding Agents

Abstract A Dockerless environment-free agentic patch verifier improves code patch evaluation accuracy and enables effective post-training without execution-based verification costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Program verifiers play a central role in training…

21
TechCrunch — AI news-outlet 1d ago

Nvidia competitor Etched hits $5B valuation, $1B in sales for AI chip

Nvidia AI chip competitor Etched says it has already booked $1 billion under contract for the inference systems powered by its chip.

19
Hugging Face Daily Papers research 1d ago

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

Abstract ILLUME-X is a unified multimodal paradigm that enhances text-image generation through improved data efficiency, stable training processes, and comprehensive evaluation metrics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The advancement of generative AI models capable…

17
Hugging Face Daily Papers research 1d ago

Learning Transferable Dynamics Priors from Action to World Modeling

Abstract Action-conditioned world modeling enables transferable dynamics priors for robot learning through pretraining on large-scale manipulation data, supporting both simulator-based policy evaluation and video-action prediction. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

27
Hugging Face Daily Papers research 2d ago

Trimming the Long-Tail of Visual World Modeling Evaluation

Abstract Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Physical interactions follow a…

28
arXiv — Machine Learning research 2d ago

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived…

16
arXiv — Machine Learning research 2d ago

How Far Can Sharpness and Complexity Jointly Explain Generalization?

arXiv:2606.29043v1 Announce Type: new Abstract: Sharpness and complexity are two central factors in the generalization analysis of deep neural networks. Existing quantitative evaluations of generalization measures have largely focused on individual scalar measures, leaving the…

13
arXiv — Machine Learning research 2d ago

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

arXiv:2606.29110v1 Announce Type: new Abstract: Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating…

32
arXiv — Machine Learning research 2d ago

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using…

27
arXiv — Machine Learning research 2d ago

Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation

arXiv:2606.29471v1 Announce Type: new Abstract: Strictly proper scoring rules identify the true conditional class distribution at population level, but their curvature can alter optimization and finite-sample behavior. We study three multiclass objectives: a class-aware…

23
arXiv — NLP / Computation & Language research 2d ago

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce…

28
arXiv — NLP / Computation & Language research 2d ago

Fine-Tuning General-Purpose Large Language Models for Agricultural Applications:A Reproducible Framework and Evaluation Protocol Based on Qwen3-8B

arXiv:2606.28992v1 Announce Type: new Abstract: General-purpose large language models (LLMs) have demonstrated strong abilities in opendomain question answering, information extraction, and text generation. Agricultural applications, however, are domain-specific,…

20
arXiv — NLP / Computation & Language research 2d ago

Understanding Evaluation Illusion in Diffusion Large Language Models

arXiv:2606.29228v1 Announce Type: new Abstract: Despite the capability of parallel decoding, diffusion large language models (dLLMs) require many denoising steps to maintain generation quality, motivating recent research on efficient decoding strategies. However, existing…

23
arXiv — NLP / Computation & Language research 2d ago

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against…

8
arXiv — NLP / Computation & Language research 2d ago

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical…

10
arXiv — NLP / Computation & Language research 2d ago

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it…

4

TRIE: An Evaluation Framework for Stochastic PDE Surrogates

Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions

Generative Refinement for Low-Budget Black-Box Optimization

Detecting the Undetectable: Enhancing Unsupervised time series Anomaly Detection via Active Learning

Constrained Bayesian Optimisation with Multiple Information Sources

LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning

Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework

Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

Auditing Forgetting in Limited Memory Language Models

MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors

Quantifying the Affective Gap: A Zero-Shot Evaluation of LLMs on Fine-Grained Emotion Taxonomies

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Measuring the Gap Between Human and LLM Research Ideas

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Venice AI becomes a unicorn with $65M Series A as its privacy-first AI platform takes off

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management

Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian

STEB: Style Text Embedding Benchmark

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

Wayve launches $85M employee tender offer at $8.5B valuation

Dockerless: Environment-Free Program Verifier for Coding Agents

Nvidia competitor Etched hits $5B valuation, $1B in sales for AI chip

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

Learning Transferable Dynamics Priors from Action to World Modeling

Trimming the Long-Tail of Visual World Modeling Evaluation

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

How Far Can Sharpness and Complexity Jointly Explain Generalization?

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

Fine-Tuning General-Purpose Large Language Models for Agricultural Applications:A Reproducible Framework and Evaluation Protocol Based on Qwen3-8B

Understanding Evaluation Illusion in Diffusion Large Language Models

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation