Tag

Funding

500 articles archived under #funding · RSS

arXiv — Machine Learning research 16d ago

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit…

11
arXiv — Machine Learning research 16d ago

Repeated Bilateral Trade: The Quest for Fairness

arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the…

34
arXiv — Machine Learning research 16d ago

PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation

arXiv:2606.15452v1 Announce Type: new Abstract: Rare events in time series are critical to model but hard to learn due to data scarcity. Current generative models struggle with extreme values. We observe that rare events leave distinct topological fingerprints - transitions in…

17
arXiv — Machine Learning research 16d ago

Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

arXiv:2606.15887v1 Announce Type: new Abstract: Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR,…

4
arXiv — NLP / Computation & Language research 16d ago

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal…

22
arXiv — NLP / Computation & Language research 16d ago

ReportQA: QA-Based Radiology Report Evaluation

arXiv:2606.15037v1 Announce Type: new Abstract: Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus…

38
arXiv — NLP / Computation & Language research 16d ago

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior…

7
arXiv — NLP / Computation & Language research 16d ago

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

arXiv:2606.15610v1 Announce Type: new Abstract: LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or…

5
arXiv — NLP / Computation & Language research 16d ago

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation…

28
arXiv — NLP / Computation & Language research 16d ago

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning…

30
arXiv — NLP / Computation & Language research 16d ago

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

arXiv:2606.16000v1 Announce Type: new Abstract: We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be…

22
arXiv — NLP / Computation & Language research 16d ago

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

arXiv:2606.16026v1 Announce Type: new Abstract: We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across…

37
arXiv — NLP / Computation & Language research 16d ago

Evaluating LLM Personalization via Semantic Constraint Verification

arXiv:2606.16368v1 Announce Type: new Abstract: Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address…

38
OpenAI official-blog 16d ago

Predicting model behavior before release by simulating deployment

OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.

27
r/MachineLearning community 16d ago

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first…

6
The Information — AI news-outlet 16d ago

Nvidia Plans To Raise At Least $20 Billion In Bonds

Nvidia said Monday it plans to raise new debt even as the AI chip leader keeps generating tens of billions of dollars in cash every quarter. It will be the company’s first corporate bond sale since 2021, when it raised $5 billion. Bloomberg earlier reported that Nvidia would…

29
The Information — AI news-outlet 17d ago

Salesforce to Acquire Customer AI Agent Fin for $3.6 Billion

Salesforce has agreed to buy Fin, a startup that develops customer agents formerly known as Intercom, for $3.6 billion, as the software giant hopes to win new businesses from enterprises to adopt its own AI offering. The sale price is a big premium to Fin’s last valuation of $2…

18
The Information — AI news-outlet 17d ago

Exclusive: Nvidia Server Marketplace Startup Raises $100 Million at $800 Million Valuation

Data center software startup and AI-server broker Hydra Host has raised $100 million at a valuation of close to $800 million, led by Kindred Ventures. Nvidia, Cathie Wood’s ARK Invest, early CoreWeave backer Magnetar, and existing investors Founders Fund and Flume Ventures also…

26
arXiv — Machine Learning research 17d ago

A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series

arXiv:2606.13823v1 Announce Type: new Abstract: We study training-free fixed-length descriptors for multivariate time series and ask not merely whether such a descriptor performs well, but when it can be expected to work at all. Our object of study is $D(\tau)$, built from a…

15
arXiv — Machine Learning research 17d ago

DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation

arXiv:2606.14192v1 Announce Type: new Abstract: Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement…

9
arXiv — NLP / Computation & Language research 17d ago

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks…

29
arXiv — NLP / Computation & Language research 17d ago

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

arXiv:2606.13944v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt…

33
arXiv — NLP / Computation & Language research 17d ago

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models…

5
arXiv — NLP / Computation & Language research 17d ago

OdysSim: Building Foundation Models for Human Behavior Simulation

arXiv:2606.14199v1 Announce Type: new Abstract: Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register,…

8
arXiv — NLP / Computation & Language research 17d ago

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

arXiv:2606.14278v1 Announce Type: new Abstract: Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it…

21
arXiv — NLP / Computation & Language research 17d ago

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

arXiv:2606.14516v1 Announce Type: cross Abstract: AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats,…

38
r/LocalLLaMA community 17d ago

Quality evaluation of quants with limited time or tokens

About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3…

36
r/MachineLearning community 18d ago

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,…

24
Hugging Face Daily Papers research 19d ago

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Abstract Psychometric assessments of LLM behavior reveal that specific behavioral frameworks like Theory of Planned Behavior show better coherence with actual responses than broad personality traits, particularly within shared conversations. Generated by…

6
TechCrunch — AI news-outlet 19d ago

Mistral is rumored to be raising €3B at €20B valuation

The funding round would value the company at around €20 billion (about $23.15 billion), nearly double its Series C valuation of €11.7 billion.

23
Hugging Face official-blog 19d ago

olmo-eval: An evaluation workbench for the model development loop

Back to Articles olmo-eval: An evaluation workbench for the model development loop Enterprise Article Published June 12, 2026 Upvote - Tyler Murray undfined allenai Kyle Wiggers Ai2Comms allenai 💻 Code: https://github.com/allenai/olmo-eval While you're building an LLM, you…

23
Hugging Face Daily Papers research 20d ago

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large…

6
Hugging Face Daily Papers research 20d ago

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

Abstract WEAVER is a multi-view world model architecture that achieves high fidelity, consistency, and efficiency in robotic manipulation tasks through flow-matching loss and demonstrates superior performance in policy evaluation, improvement, and test-time planning. Generated…

27
arXiv — NLP / Computation & Language research 20d ago

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most…

23
arXiv — NLP / Computation & Language research 20d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to…

26
arXiv — NLP / Computation & Language research 20d ago

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to…

30
arXiv — NLP / Computation & Language research 20d ago

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

arXiv:2606.12730v1 Announce Type: cross Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in…

12
Hugging Face Daily Papers research 20d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search…

26
TechCrunch — AI news-outlet 20d ago

Theker just raised $85M to build the factory robot that doesn’t specialize in anything

Unlike humanoid robots designed around a fixed form — think Boston Dynamics — Theker's machines are built to be reconfigured.

18
Hugging Face Daily Papers research 21d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

16
arXiv — NLP / Computation & Language research 21d ago

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

arXiv:2606.11205v1 Announce Type: cross Abstract: Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance…

5
arXiv — Machine Learning research 21d ago

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

arXiv:2606.11235v1 Announce Type: new Abstract: A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the…

19
arXiv — Machine Learning research 21d ago

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

arXiv:2606.11268v1 Announce Type: new Abstract: Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data,…

20
arXiv — Machine Learning research 21d ago

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

arXiv:2606.11409v1 Announce Type: new Abstract: Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of…

5
arXiv — Machine Learning research 21d ago

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

arXiv:2606.11657v1 Announce Type: new Abstract: Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style…

29
arXiv — Machine Learning research 21d ago

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

arXiv:2606.12016v1 Announce Type: new Abstract: Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware,…

27
arXiv — Machine Learning research 21d ago

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

arXiv:2606.12077v1 Announce Type: new Abstract: Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance…

15
arXiv — Machine Learning research 21d ago

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

arXiv:2606.12240v1 Announce Type: new Abstract: Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional…

26
arXiv — Machine Learning research 21d ago

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

arXiv:2606.12252v1 Announce Type: new Abstract: Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly…

8
arXiv — NLP / Computation & Language research 21d ago

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

arXiv:2606.11196v1 Announce Type: new Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without…

20

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

Repeated Bilateral Trade: The Quest for Fairness

PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation

Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

ReportQA: QA-Based Radiology Report Evaluation

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

Evaluating LLM Personalization via Semantic Constraint Verification

Predicting model behavior before release by simulating deployment

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

Nvidia Plans To Raise At Least $20 Billion In Bonds

Salesforce to Acquire Customer AI Agent Fin for $3.6 Billion

Exclusive: Nvidia Server Marketplace Startup Raises $100 Million at $800 Million Valuation

A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series

DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

OdysSim: Building Foundation Models for Human Behavior Simulation

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Quality evaluation of quants with limited time or tokens

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Mistral is rumored to be raising €3B at €20B valuation

olmo-eval: An evaluation workbench for the model development loop

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Theker just raised $85M to build the factory robot that doesn&#8217;t specialize in anything

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

Theker just raised $85M to build the factory robot that doesn’t specialize in anything