News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — Machine Learning research 16d ago Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit… 11 arXiv — Machine Learning research 16d ago Repeated Bilateral Trade: The Quest for Fairness arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the… 34 arXiv — Machine Learning research 16d ago PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation arXiv:2606.15452v1 Announce Type: new Abstract: Rare events in time series are critical to model but hard to learn due to data scarcity. Current generative models struggle with extreme values. We observe that rare events leave distinct topological fingerprints - transitions in… 17 arXiv — Machine Learning research 16d ago Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes arXiv:2606.15887v1 Announce Type: new Abstract: Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR,… 4 arXiv — NLP / Computation & Language research 16d ago Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal… 22 arXiv — NLP / Computation & Language research 16d ago ReportQA: QA-Based Radiology Report Evaluation arXiv:2606.15037v1 Announce Type: new Abstract: Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus… 38 arXiv — NLP / Computation & Language research 16d ago A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior… 7 arXiv — NLP / Computation & Language research 16d ago LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation arXiv:2606.15610v1 Announce Type: new Abstract: LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or… 5 arXiv — NLP / Computation & Language research 16d ago Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation… 28 arXiv — NLP / Computation & Language research 16d ago A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning… 30 arXiv — NLP / Computation & Language research 16d ago GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science arXiv:2606.16000v1 Announce Type: new Abstract: We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be… 22 arXiv — NLP / Computation & Language research 16d ago In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation arXiv:2606.16026v1 Announce Type: new Abstract: We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across… 37 arXiv — NLP / Computation & Language research 16d ago Evaluating LLM Personalization via Semantic Constraint Verification arXiv:2606.16368v1 Announce Type: new Abstract: Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address… 38 OpenAI official-blog 16d ago Predicting model behavior before release by simulating deployment OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy. 27 r/MachineLearning community 16d ago Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D] I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first… 6 The Information — AI news-outlet 16d ago Nvidia Plans To Raise At Least $20 Billion In Bonds Nvidia said Monday it plans to raise new debt even as the AI chip leader keeps generating tens of billions of dollars in cash every quarter. It will be the company’s first corporate bond sale since 2021, when it raised $5 billion. Bloomberg earlier reported that Nvidia would… 29 The Information — AI news-outlet 17d ago Salesforce to Acquire Customer AI Agent Fin for $3.6 Billion Salesforce has agreed to buy Fin, a startup that develops customer agents formerly known as Intercom, for $3.6 billion, as the software giant hopes to win new businesses from enterprises to adopt its own AI offering. The sale price is a big premium to Fin’s last valuation of $2… 18 The Information — AI news-outlet 17d ago Exclusive: Nvidia Server Marketplace Startup Raises $100 Million at $800 Million Valuation Data center software startup and AI-server broker Hydra Host has raised $100 million at a valuation of close to $800 million, led by Kindred Ventures. Nvidia, Cathie Wood’s ARK Invest, early CoreWeave backer Magnetar, and existing investors Founders Fund and Flume Ventures also… 26 arXiv — Machine Learning research 17d ago A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series arXiv:2606.13823v1 Announce Type: new Abstract: We study training-free fixed-length descriptors for multivariate time series and ask not merely whether such a descriptor performs well, but when it can be expected to work at all. Our object of study is $D(\tau)$, built from a… 15 arXiv — Machine Learning research 17d ago DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation arXiv:2606.14192v1 Announce Type: new Abstract: Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement… 9 arXiv — NLP / Computation & Language research 17d ago The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks… 29 arXiv — NLP / Computation & Language research 17d ago LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values arXiv:2606.13944v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt… 33 arXiv — NLP / Computation & Language research 17d ago Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models… 5 arXiv — NLP / Computation & Language research 17d ago OdysSim: Building Foundation Models for Human Behavior Simulation arXiv:2606.14199v1 Announce Type: new Abstract: Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register,… 8 arXiv — NLP / Computation & Language research 17d ago Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge arXiv:2606.14278v1 Announce Type: new Abstract: Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it… 21 arXiv — NLP / Computation & Language research 17d ago Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results arXiv:2606.14516v1 Announce Type: cross Abstract: AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats,… 38 r/LocalLLaMA community 17d ago Quality evaluation of quants with limited time or tokens About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3… 36 r/MachineLearning community 18d ago The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R] We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,… 24 Hugging Face Daily Papers research 19d ago Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior Abstract Psychometric assessments of LLM behavior reveal that specific behavioral frameworks like Theory of Planned Behavior show better coherence with actual responses than broad personality traits, particularly within shared conversations. Generated by… 6 TechCrunch — AI news-outlet 19d ago Mistral is rumored to be raising €3B at €20B valuation The funding round would value the company at around €20 billion (about $23.15 billion), nearly double its Series C valuation of €11.7 billion. 23 Hugging Face official-blog 19d ago olmo-eval: An evaluation workbench for the model development loop Back to Articles olmo-eval: An evaluation workbench for the model development loop Enterprise Article Published June 12, 2026 Upvote - Tyler Murray undfined allenai Kyle Wiggers Ai2Comms allenai 💻 Code: https://github.com/allenai/olmo-eval While you're building an LLM, you… 23 Hugging Face Daily Papers research 20d ago Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large… 6 Hugging Face Daily Papers research 20d ago WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation Abstract WEAVER is a multi-view world model architecture that achieves high fidelity, consistency, and efficiency in robotic manipulation tasks through flow-matching loss and demonstrates superior performance in policy evaluation, improvement, and test-time planning. Generated… 27 arXiv — NLP / Computation & Language research 20d ago LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most… 23 arXiv — NLP / Computation & Language research 20d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to… 26 arXiv — NLP / Computation & Language research 20d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to… 30 arXiv — NLP / Computation & Language research 20d ago Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior arXiv:2606.12730v1 Announce Type: cross Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in… 12 Hugging Face Daily Papers research 20d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search… 26 TechCrunch — AI news-outlet 20d ago Theker just raised $85M to build the factory robot that doesn’t specialize in anything Unlike humanoid robots designed around a fixed form — think Boston Dynamics — Theker's machines are built to be reconfigured. 18 Hugging Face Daily Papers research 21d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 16 arXiv — NLP / Computation & Language research 21d ago Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention arXiv:2606.11205v1 Announce Type: cross Abstract: Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance… 5 arXiv — Machine Learning research 21d ago Few-Shot Resampling for Scalable Statistically-Sound Data Mining arXiv:2606.11235v1 Announce Type: new Abstract: A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the… 19 arXiv — Machine Learning research 21d ago LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data arXiv:2606.11268v1 Announce Type: new Abstract: Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data,… 20 arXiv — Machine Learning research 21d ago Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models arXiv:2606.11409v1 Announce Type: new Abstract: Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of… 5 arXiv — Machine Learning research 21d ago Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics arXiv:2606.11657v1 Announce Type: new Abstract: Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style… 29 arXiv — Machine Learning research 21d ago Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization arXiv:2606.12016v1 Announce Type: new Abstract: Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware,… 27 arXiv — Machine Learning research 21d ago Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization arXiv:2606.12077v1 Announce Type: new Abstract: Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance… 15 arXiv — Machine Learning research 21d ago Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training arXiv:2606.12240v1 Announce Type: new Abstract: Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional… 26 arXiv — Machine Learning research 21d ago Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification arXiv:2606.12252v1 Announce Type: new Abstract: Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly… 8 arXiv — NLP / Computation & Language research 21d ago PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference arXiv:2606.11196v1 Announce Type: new Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without… 20 Page 5 of 10 · 500 articles ← Newer Older →