News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — Machine Learning research 2h ago TRIE: An Evaluation Framework for Stochastic PDE Surrogates arXiv:2607.00196v1 Announce Type: new Abstract: Many scientific systems exhibit uncertainty from stochastic forcing, unresolved degrees of freedom, or imperfect observations, making reliable surrogate forecasting fundamentally distributional rather than pointwise. For such… 37 arXiv — NLP / Computation & Language research 2h ago Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions arXiv:2607.00304v1 Announce Type: cross Abstract: The bias-reliability tradeoff conjectures that LLM evaluation systems are constrained in (gamma, H, CV) space, where evaluator coupling (gamma), strategy diversity (H), and small-sample measurement reliability (CV(N)) cannot be… 7 arXiv — Machine Learning research 2h ago Generative Refinement for Low-Budget Black-Box Optimization arXiv:2607.00691v1 Announce Type: new Abstract: Black-box optimization is a fundamental science and engineering tool that makes it possible to optimize objectives without gradient information. Unfortunately, as it often requires many function evaluations, it can be challenging… 31 arXiv — Machine Learning research 2h ago Detecting the Undetectable: Enhancing Unsupervised time series Anomaly Detection via Active Learning arXiv:2607.00720v1 Announce Type: new Abstract: Despite the increasing sophistication of industrial AI systems, the ability to reliably detect subtle and noisy anomalies in complex time series data remains a critical yet unresolved challenge. In large-scale industrial… 13 arXiv — Machine Learning research 2h ago Constrained Bayesian Optimisation with Multiple Information Sources arXiv:2607.00865v1 Announce Type: new Abstract: Bayesian Optimisation (BO) under unknown constraints is particularly challenging when feasible regions are small. In such settings, existing methods that typically rely solely on evaluations of the true objective and constraints… 32 arXiv — Machine Learning research 2h ago LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning arXiv:2607.00958v1 Announce Type: new Abstract: Time series are central to modern data mining applications, from industrial telemetry and server metrics to finance and physiology, yet time-series self-supervised learning often depends on view and augmentation choices that encode… 14 arXiv — NLP / Computation & Language research 2h ago Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth arXiv:2607.00139v1 Announce Type: new Abstract: The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, high-stakes domains. This is particularly acute for Arabic sociolinguistic knowledge: credible grading requires not only… 20 arXiv — NLP / Computation & Language research 2h ago ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs arXiv:2607.00171v1 Announce Type: new Abstract: Text embeddings are standard for semantic similarity tasks, yet their evaluation remains an open challenge. Current benchmarks are static, cover only a limited set of languages, are often domain-specific, susceptible to… 4 arXiv — NLP / Computation & Language research 2h ago SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework arXiv:2607.00274v1 Announce Type: new Abstract: Effective writing feedback is among the strongest drivers of student learning, yet producing it at scale is labor-intensive. LLMs offer a natural path to scaling writing support, but two gaps stand in the way: few public corpora… 10 arXiv — NLP / Computation & Language research 2h ago Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training arXiv:2607.00368v1 Announce Type: new Abstract: Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity,… 12 arXiv — NLP / Computation & Language research 2h ago Auditing Forgetting in Limited Memory Language Models arXiv:2607.00605v1 Announce Type: new Abstract: Limited Memory Language Models (LMLMs) externalize factual knowledge to a database to enable deletion-based unlearning without retraining. Existing evaluations measure post-deletion correctness in aggregate and cannot tell whether… 5 arXiv — NLP / Computation & Language research 2h ago MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors arXiv:2607.00848v1 Announce Type: new Abstract: In this opinion paper, we propose MetaHOPE, an error severity-aware annotation framework for evaluating metaphor translations. Metaphors present challenges for machine translation (MT) and natural language understanding and… 21 arXiv — NLP / Computation & Language research 2h ago Quantifying the Affective Gap: A Zero-Shot Evaluation of LLMs on Fine-Grained Emotion Taxonomies arXiv:2607.00968v1 Announce Type: new Abstract: Emotion recognition in natural language is a foundational challenge in affective computing, with critical implications for human-computer interaction, mental health support, and conversational AI. This paper presents a rigorous,… 20 arXiv — NLP / Computation & Language research 2h ago Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking arXiv:2607.01103v1 Announce Type: new Abstract: Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration… 12 arXiv — NLP / Computation & Language research 2h ago Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded… 14 arXiv — NLP / Computation & Language research 2h ago Measuring the Gap Between Human and LLM Research Ideas arXiv:2607.01233v1 Announce Type: new Abstract: LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human… 8 arXiv — NLP / Computation & Language research 2h ago Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages arXiv:2607.01161v1 Announce Type: cross Abstract: Cross-lingual speaker verification (SV) systems typically exhibit performance degradation when enrollment and test utterances are spoken in different languages. However, standard evaluation protocols confound language mismatch… 16 arXiv — NLP / Computation & Language research 2h ago OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning arXiv:2510.24636v3 Announce Type: replace Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and… 34 TechCrunch — AI news-outlet 15h ago Venice AI becomes a unicorn with $65M Series A as its privacy-first AI platform takes off Venice AI is already profitable, with annualized run-rate revenues of over $70 million, CEO Erik Voorhees said. 17 Hugging Face Daily Papers research 19h ago Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation Abstract Procedural memory enhances LLM agents on workplace tasks through skill transfer across roles and models, with varying generalization capabilities affecting deployment strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Procedural memory is increasingly used to… 22 arXiv — NLP / Computation & Language research 1d ago When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs arXiv:2606.30814v1 Announce Type: new Abstract: Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration… 21 arXiv — NLP / Computation & Language research 1d ago Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support arXiv:2606.30887v1 Announce Type: new Abstract: Large language models show promise for mental health support, yet therapeutic quality improves only when evaluation functions as an actionable control signal rather than a passive metric. We introduce a framework that formulates… 32 arXiv — NLP / Computation & Language research 1d ago Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with… 7 arXiv — NLP / Computation & Language research 1d ago Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law arXiv:2606.31250v1 Announce Type: new Abstract: Large language models (LLM) trained on web-scale corpora generate output that may infringe copyright, yet existing technical safeguards focus narrowly on verbatim memorisation. EU copyright doctrine applies a broader standards:… 36 arXiv — NLP / Computation & Language research 1d ago CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations… 37 arXiv — NLP / Computation & Language research 1d ago Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues arXiv:2606.31644v1 Announce Type: new Abstract: As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations… 5 arXiv — NLP / Computation & Language research 1d ago Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management arXiv:2606.31692v1 Announce Type: new Abstract: This paper presents an overview of the second edition of the TalentCLEF challenge, organized as a Lab at the Conference and Labs of the Evaluation Forum (CLEF) 2026. TalentCLEF is an initiative aimed at advancing Natural Language… 19 arXiv — NLP / Computation & Language research 1d ago Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian arXiv:2606.31718v1 Announce Type: new Abstract: Relation extraction (RE) for low-resource languages is typically constrained by the lack of annotated corpora. We investigate the feasibility of cross-lingual RE for Romanian by combining automatic dataset translation with large… 38 arXiv — NLP / Computation & Language research 1d ago STEB: Style Text Embedding Benchmark arXiv:2606.31741v1 Announce Type: new Abstract: While semantic embeddings are rigorously evaluated on the Massive Text Embedding Benchmark, the evaluation of style embeddings remains fragmented, with each work relying on their own set of tasks and datasets. To bridge this gap,… 27 arXiv — NLP / Computation & Language research 1d ago Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper… 25 arXiv — NLP / Computation & Language research 1d ago HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite… 29 arXiv — NLP / Computation & Language research 1d ago SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA arXiv:2504.07385v3 Announce Type: replace Abstract: As Large Language Models (LLMs) become increasingly used for question-answering (QA), relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. Meanwhile,… 26 arXiv — NLP / Computation & Language research 1d ago FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge arXiv:2602.06625v2 Announce Type: replace Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and… 7 TechCrunch — AI news-outlet 1d ago Wayve launches $85M employee tender offer at $8.5B valuation Wayve’s offering is part of a growing trend of AI startups using employee tenders as a strategic tool to attract and retain talent. 31 Hugging Face Daily Papers research 1d ago Dockerless: Environment-Free Program Verifier for Coding Agents Abstract A Dockerless environment-free agentic patch verifier improves code patch evaluation accuracy and enables effective post-training without execution-based verification costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Program verifiers play a central role in training… 21 TechCrunch — AI news-outlet 1d ago Nvidia competitor Etched hits $5B valuation, $1B in sales for AI chip Nvidia AI chip competitor Etched says it has already booked $1 billion under contract for the inference systems powered by its chip. 19 Hugging Face Daily Papers research 1d ago Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation Abstract ILLUME-X is a unified multimodal paradigm that enhances text-image generation through improved data efficiency, stable training processes, and comprehensive evaluation metrics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The advancement of generative AI models capable… 17 Hugging Face Daily Papers research 1d ago Learning Transferable Dynamics Priors from Action to World Modeling Abstract Action-conditioned world modeling enables transferable dynamics priors for robot learning through pretraining on large-scale manipulation data, supporting both simulator-based policy evaluation and video-action prediction. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We… 27 Hugging Face Daily Papers research 2d ago Trimming the Long-Tail of Visual World Modeling Evaluation Abstract Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Physical interactions follow a… 28 arXiv — Machine Learning research 2d ago Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived… 16 arXiv — Machine Learning research 2d ago How Far Can Sharpness and Complexity Jointly Explain Generalization? arXiv:2606.29043v1 Announce Type: new Abstract: Sharpness and complexity are two central factors in the generalization analysis of deep neural networks. Existing quantitative evaluations of generalization measures have largely focused on individual scalar measures, leaving the… 13 arXiv — Machine Learning research 2d ago Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps arXiv:2606.29110v1 Announce Type: new Abstract: Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating… 32 arXiv — Machine Learning research 2d ago Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using… 27 arXiv — Machine Learning research 2d ago Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation arXiv:2606.29471v1 Announce Type: new Abstract: Strictly proper scoring rules identify the true conditional class distribution at population level, but their curvature can alter optimization and finite-sample behavior. We study three multiclass objectives: a class-aware… 23 arXiv — NLP / Computation & Language research 2d ago SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce… 28 arXiv — NLP / Computation & Language research 2d ago Fine-Tuning General-Purpose Large Language Models for Agricultural Applications:A Reproducible Framework and Evaluation Protocol Based on Qwen3-8B arXiv:2606.28992v1 Announce Type: new Abstract: General-purpose large language models (LLMs) have demonstrated strong abilities in opendomain question answering, information extraction, and text generation. Agricultural applications, however, are domain-specific,… 20 arXiv — NLP / Computation & Language research 2d ago Understanding Evaluation Illusion in Diffusion Large Language Models arXiv:2606.29228v1 Announce Type: new Abstract: Despite the capability of parallel decoding, diffusion large language models (dLLMs) require many denoising steps to maintain generation quality, motivating recent research on efficient decoding strategies. However, existing… 23 arXiv — NLP / Computation & Language research 2d ago Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against… 8 arXiv — NLP / Computation & Language research 2d ago Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical… 10 arXiv — NLP / Computation & Language research 2d ago MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it… 4 Page 1 of 10 · 500 articles Older →