News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — NLP / Computation & Language research 7d ago Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation arXiv:2606.25782v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM… 18 arXiv — NLP / Computation & Language research 7d ago Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts arXiv:2606.25935v1 Announce Type: new Abstract: Was this person ever at that place, and if so, when? Answering such questions from noisy, multilingual historical documents is the central challenge of HIPE-2026, the third edition of the HIPE evaluation series. Moving from named… 14 arXiv — NLP / Computation & Language research 7d ago SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations… 29 arXiv — NLP / Computation & Language research 7d ago RAS: Measuring LLM Safety Through Refusal Alignment arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is… 27 arXiv — NLP / Computation & Language research 7d ago Autodata: An agentic data scientist to create high quality synthetic data arXiv:2606.25996v1 Announce Type: cross Abstract: We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to… 30 arXiv — NLP / Computation & Language research 7d ago Robustness assessment of large audio language models in multiple-choice evaluation arXiv:2510.04584v2 Announce Type: replace Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in… 13 arXiv — NLP / Computation & Language research 7d ago Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge arXiv:2602.02219v2 Announce Type: replace Abstract: Large language models are widely employed as evaluators, a paradigm commonly referred to as LLM-as-a-judge. Prior research has predominantly examined point-wise or pair-wise evaluation protocols; in contrast, our focus is on… 8 Hugging Face Daily Papers research 7d ago Are We Ready For An Agent-Native Memory System? Abstract Large language model agents' memory systems have evolved into complex data management frameworks requiring systematic evaluation across multiple modules and workloads to understand their performance characteristics and trade-offs. Generated by… 7 Hugging Face Daily Papers research 8d ago DiffusionBench: On Holistic Evaluation of Diffusion Transformers Abstract Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers that demonstrates the need for comprehensive benchmarking beyond ImageNet class-conditional generation to assess true progress in generative modeling. Generated by… 25 arXiv — Machine Learning research 8d ago Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery arXiv:2606.23757v1 Announce Type: new Abstract: Extracting interpretable governing equations from sparse, noisy chemical time-series data remains difficult because discrete reaction topology and continuous kinetic parameters are tightly coupled. We present PC-MCMC-CIGP, a… 33 arXiv — Machine Learning research 8d ago One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline arXiv:2606.23767v1 Announce Type: new Abstract: Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors' own protocol -- different pair subsets, weightings, model-selection, and decision rates.… 34 arXiv — Machine Learning research 8d ago Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data arXiv:2606.23871v1 Announce Type: new Abstract: Survival analysis is central to clinical decision-making, yet reliable time-to-event models require large, diverse cohorts that are rarely available at a single institution, while privacy regulations restrict the centralization of… 28 arXiv — Machine Learning research 8d ago GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series arXiv:2606.23880v1 Announce Type: new Abstract: From climate teleconnections to gene regulation, modern time-series datasets encompass tens or hundreds of interacting variables, making causal discovery increasingly challenging. Constraint-based methods offer statistical rigor… 30 arXiv — Machine Learning research 8d ago You Don't Need to Run Every Eval arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to… 29 arXiv — Machine Learning research 8d ago Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation arXiv:2606.24340v1 Announce Type: new Abstract: In recent years, the Internet of Things (IoT) paradigm has been shifting toward batteryless, energy-harvesting architectures. Sustaining reliable operation in these systems requires intelligent management of highly volatile stored… 30 arXiv — Machine Learning research 8d ago A Fair Evaluation of Graph Foundation Models for Node Property Prediction arXiv:2606.24509v1 Announce Type: new Abstract: Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called… 33 arXiv — NLP / Computation & Language research 8d ago Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs arXiv:2606.23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside… 28 arXiv — Machine Learning research 8d ago Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web arXiv:2606.24236v1 Announce Type: cross Abstract: Visual assessment of residual plots is a common approach for diagnosing linear models, but it relies on manual evaluation, which does not scale well and can lead to inconsistent decisions across analysts. The lineup protocol,… 16 arXiv — Machine Learning research 8d ago PROTECT-90: A Fault Dataset for Power System Protection arXiv:2606.24298v1 Announce Type: cross Abstract: The increasing interest in data-driven methods for power system protection is accompanied by a lack of standardized, publicly available high-voltage waveform datasets that enable transparent and reproducible evaluation. To… 36 arXiv — Machine Learning research 8d ago EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics arXiv:2606.24586v1 Announce Type: cross Abstract: Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate… 19 arXiv — NLP / Computation & Language research 8d ago Quantifying Prior Dominance in RAG Systems arXiv:2606.23695v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ''epistemic blindness'' - failing to distinguish genuine contextual… 28 arXiv — NLP / Computation & Language research 8d ago QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark… 32 arXiv — NLP / Computation & Language research 8d ago MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language,… 38 arXiv — NLP / Computation & Language research 8d ago Aspect-Based Sentiment Evolution and its Correlation with Review Rounds in Multi-Round Peer Reviews: A Deep Learning Approach arXiv:2606.24188v1 Announce Type: new Abstract: Mining sentiment information from the textual content of peer review comments offers valuable insights into the scientific evaluation process. However, previous studies are often constrained by coarse-grained analysis and the lack… 19 arXiv — NLP / Computation & Language research 8d ago SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization arXiv:2606.24259v1 Announce Type: new Abstract: Fine-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical… 4 arXiv — NLP / Computation & Language research 8d ago On the Stability of Prompt Ranking in Large Language Model Evaluation arXiv:2606.24381v1 Announce Type: new Abstract: Prompt-based interaction has become a dominant paradigm for using large language models (LLMs), where multiple candidate prompts are evaluated and the top-ranked one is selected for downstream use. This workflow implicitly assumes… 34 arXiv — NLP / Computation & Language research 8d ago Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models arXiv:2606.24610v1 Announce Type: new Abstract: The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range… 10 arXiv — NLP / Computation & Language research 8d ago AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline… 25 arXiv — NLP / Computation & Language research 8d ago ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained… 15 arXiv — NLP / Computation & Language research 8d ago The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs arXiv:2504.17768v3 Announce Type: replace Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with… 29 Hugging Face Daily Papers research 8d ago Libretto: Giving LLM Agents a Sense of Musical Structure Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from… 18 OpenAI official-blog 8d ago Helping build shared standards for advanced AI OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation. 31 Hugging Face Daily Papers research 9d ago Counsel: A Meta-Evaluation Dataset for Agentic Tasks Abstract A large-scale dataset of human-metaevaluations of LLM critiques for agentic tasks is introduced to improve the calibration and reliability of automated evaluation methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As agentic systems tackle increasingly complex… 22 r/LocalLLaMA community 9d ago Human Evaluation of GLM-5.2 I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now. The last few days I've been testing it on a real world project, and it's… 6 Hugging Face Daily Papers research 9d ago EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents… 30 r/LocalLLaMA community 9d ago Boogu Base, Turbo, Edit - open-source unified image generation and editing model series Boogu-Image-0.1 is a competitive Apache-2.0 open-source unified image generation and editing model family , including Base , Turbo , Edit , and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing,… 22 Hugging Face Daily Papers research 9d ago DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to… 14 r/LocalLLaMA community 9d ago DeepSeek raises $7.4B USD at $60B valuation. Remarkably, Liang Wenfeng invests $3B in DeepSeek himself.   submitted by   /u/FullOf_Bad_Ideas [link]   [comments] 35 r/MachineLearning community 11d ago TSAuditor: A time-series auditing framework [P] This happened a few months ago when I was working on an analysis project that dealt with time-series data. The dataset was large (10 years of data). I was using a standard profiling tool to check the pipeline. Everything looked fine because the tool reported 3% missing data rate… 29 Hugging Face Daily Papers research 12d ago The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation Abstract Analysis of FID variance across different training and sampling seeds reveals significant reproducibility issues in image generation evaluation, with retraining causing larger fluctuations than resampling, and recommends updated evaluation protocols with error bars and… 21 r/MachineLearning community 12d ago Best library for releasing my research optimization algorithm? [D] Hi All! I have developed a research optimizer (QQN Quadratic Quasi-Newton) and published a paper on it where I am able to, but I would really like to make the algorithm itself easily available to the community for evaluation. I have a Rust, Java, and Javascript implementations,… 36 TechCrunch — AI news-outlet 12d ago The CEO of Allbirds’ new AI biz has a plan, but no employees Call it a startup with a sole founder and a very large seed round, but what's next is less clear. 23 Hugging Face Daily Papers research 13d ago Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages Abstract Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 33 arXiv — Machine Learning research 13d ago Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures arXiv:2606.19365v1 Announce Type: new Abstract: Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly… 35 arXiv — Machine Learning research 13d ago MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery arXiv:2606.19624v1 Announce Type: new Abstract: Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the… 16 arXiv — Machine Learning research 13d ago SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models arXiv:2606.19888v1 Announce Type: new Abstract: Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent… 11 arXiv — Machine Learning research 13d ago PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection arXiv:2606.20055v1 Announce Type: new Abstract: Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational… 21 arXiv — Machine Learning research 13d ago Learner-based Concept Drift Detection: Analysis and Evaluation arXiv:2606.20216v1 Announce Type: new Abstract: Machine learning algorithms deployed for evolving streaming environments must handle the non-stationary data distributions, commonly referred to as concept drift. The presence of concept drift poses a major challenge for many… 23 arXiv — NLP / Computation & Language research 13d ago Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias arXiv:2606.19544v1 Announce Type: new Abstract: LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates… 34 arXiv — NLP / Computation & Language research 13d ago IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources arXiv:2606.20089v1 Announce Type: new Abstract: Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a… 15 Page 3 of 10 · 500 articles ← Newer Older →