Tag

Funding

500 articles archived under #funding · RSS

The Information — AI news-outlet 1mo ago

AI Evaluators Struggle with Models That Know When They’re Being Tested

AI researchers are starting to make progress on a confounding problem: AI models are getting better at telling when they are in an evaluation. That could become a problem for AI companies that use evaluations to gauge the capabilities and behaviors of their models before…

37
arXiv — Machine Learning research 1mo ago

Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

arXiv:2605.30376v1 Announce Type: new Abstract: Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore critical inter-channel dependencies, while channel-dependent models are expressive but…

15
arXiv — Machine Learning research 1mo ago

A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI

arXiv:2605.30388v1 Announce Type: new Abstract: This paper introduces a new systematic framework for detecting anomalies in maritime Automatic Identification System (AIS) datasets. These anomalies include abnormal vessel behaviours related to speed, position jumps, time gaps,…

22
arXiv — Machine Learning research 1mo ago

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

arXiv:2605.30393v1 Announce Type: new Abstract: Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than out-of-sample skill. We introduce NumLeak, a measurement framework that combines API-boundary…

25
arXiv — Machine Learning research 1mo ago

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

arXiv:2605.30514v1 Announce Type: new Abstract: Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This…

10
arXiv — Machine Learning research 1mo ago

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

arXiv:2605.30590v1 Announce Type: new Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other…

23
arXiv — Machine Learning research 1mo ago

Conformal Reliability: A New Evaluation Metric for Conditional Generation

arXiv:2605.30807v1 Announce Type: new Abstract: Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is…

8
arXiv — Machine Learning research 1mo ago

GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring

arXiv:2605.30865v1 Announce Type: new Abstract: Continuous glucose monitoring (CGM) provides a dense view of daily metabolic physiology, yet existing generic time-series and CGM-specific foundation models often encode glucose traces as entangled single-stream sequences, leaving…

31
arXiv — NLP / Computation & Language research 1mo ago

Refining Word-Based Grammatical Error Annotation for L2 Korean

arXiv:2605.30545v1 Announce Type: new Abstract: Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they…

10
arXiv — NLP / Computation & Language research 1mo ago

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

arXiv:2605.30568v1 Announce Type: new Abstract: LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained…

37
arXiv — NLP / Computation & Language research 1mo ago

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal…

26
arXiv — NLP / Computation & Language research 1mo ago

Pairwise Reference Alignment as a Model-Level Ordinal Observable

arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference…

18
arXiv — NLP / Computation & Language research 1mo ago

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

arXiv:2605.31351v1 Announce Type: new Abstract: AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general…

30
arXiv — NLP / Computation & Language research 1mo ago

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues…

36
arXiv — NLP / Computation & Language research 1mo ago

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

arXiv:2605.31483v1 Announce Type: new Abstract: Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination…

20
Hugging Face Daily Papers research 1mo ago

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Abstract Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations. AI-generated summary Recent advances in speech generation have enabled…

5
Hugging Face Daily Papers research 1mo ago

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Abstract OpenSkillEval is an automatic evaluation framework that assesses skill-augmented agent systems and skills across diverse real-world applications, revealing that skill availability doesn't guarantee effective usage and that performance benefits depend heavily on model…

31
r/LocalLLaMA community 1mo ago

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

Author here. The short version of why I built this: Cyber-AI evaluation is converging on the same diagnosis from multiple labs. Anthropic's Claude Mythos system card this year: their cyber ranges "lack many features often present in real-world environments such as defensive…

6
r/MachineLearning community 1mo ago

Bayesian Opt. GPs vs Linear models and Neural Networks for parameter optimizations [R]

Hi, Relatively new to deep learning. I wanted some opinions on which of these approaches might be best for time series data and spectral analysis. I currently use a GP and it works pretty well, but I’m wondering what the computational tradeoffs and so forth might be. Any ideas?…

4
Hacker News — AI on Front Page community 1mo ago

OpenRouter raises $113M Series B

Article URL: https://openrouter.ai/announcements/series-b Comments URL: https://news.ycombinator.com/item?id=48338660 Points: 242 # Comments: 110

4
TechCrunch — AI news-outlet 1mo ago

The groupthink boom: what 3 top VCs really think about the AI frenzy

"If you're 22 years old in San Francisco and building something in AI, there may be a seed term sheet in your inbox — but if you're 19, oh my God, this means you're really good; you might already have a Series A [offer]," said one, half-kiddingly.

12
r/LocalLLaMA community 1mo ago

Gryphe/Pantheon-Reasoning-27B · Hugging Face

from Gryphe: An experiment in bringing reasoning capability to the Pantheon roleplay series in the form of an uncensored dense Qwen 3.6 27B. This specific model can be thought of as a successor to both the Pantheon series and the one-time Codex release since I used such a large…

15
Hacker News — AI on Front Page community 1mo ago

Danish pension fund excludes SpaceX citing governance and valuation

Article URL: https://www.reuters.com/legal/transactional/danish-pension-fund-excludes-spacex-citing-governance-valuation-2026-05-29/ Comments URL: https://news.ycombinator.com/item?id=48333820 Points: 207 # Comments: 146

23
Hugging Face Daily Papers research 1mo ago

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Abstract A parameter-efficient vision-language model is developed for time-series anomaly detection using a novel benchmark with natural-language rationales, achieving superior performance and generalization across multiple datasets. AI-generated summary Recent advances in…

38
TechCrunch — AI news-outlet 1mo ago

This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory

South Korean chip startup Xcena is betting that AI's real bottleneck is not compute, but memory.

20
Hugging Face Daily Papers research 1mo ago

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

Abstract PRISM evaluates automated peer review systems across multiple dimensions using argument mining and retrieval-augmented verification, revealing that while LLMs match human performance in specific areas, no system consistently equals human reviewers across all evaluation…

19
arXiv — Machine Learning research 1mo ago

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

arXiv:2605.28866v1 Announce Type: new Abstract: Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. However, prior studies largely overlook the inherent continuity and ordinality of time series…

20
arXiv — Machine Learning research 1mo ago

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

arXiv:2605.28867v1 Announce Type: new Abstract: Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an…

10
arXiv — Machine Learning research 1mo ago

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

arXiv:2605.29005v1 Announce Type: new Abstract: Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational…

18
arXiv — Machine Learning research 1mo ago

Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation

arXiv:2605.29108v1 Announce Type: new Abstract: Selecting efficient multi-step synthetic routes is a central challenge in organic synthesis, particularly in medicinal and process chemistry, where route choice directly impacts feasibility, cost, and development efficiency.…

28
arXiv — Machine Learning research 1mo ago

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

arXiv:2605.29156v1 Announce Type: new Abstract: Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit…

9
arXiv — Machine Learning research 1mo ago

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

arXiv:2605.29283v1 Announce Type: new Abstract: Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to…

22
arXiv — Machine Learning research 1mo ago

Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems

arXiv:2605.29373v1 Announce Type: new Abstract: Solving high-dimensional PDE-governed inverse problems is often challenging due to complex non-Gaussian posterior distributions, expensive forward model evaluations, and misspecified prior information. To address these issues, we…

13
arXiv — Machine Learning research 1mo ago

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

arXiv:2605.29500v1 Announce Type: new Abstract: Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard…

11
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated…

19
arXiv — NLP / Computation & Language research 1mo ago

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how…

31
arXiv — NLP / Computation & Language research 1mo ago

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

arXiv:2605.28882v1 Announce Type: new Abstract: With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet…

4
arXiv — NLP / Computation & Language research 1mo ago

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

arXiv:2605.29256v1 Announce Type: new Abstract: Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and…

21
arXiv — NLP / Computation & Language research 1mo ago

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods…

19
arXiv — NLP / Computation & Language research 1mo ago

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

arXiv:2605.29397v1 Announce Type: new Abstract: HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is…

35
arXiv — NLP / Computation & Language research 1mo ago

Comparative Evaluation of Machine Translation Systems on Images with Text

arXiv:2605.29476v1 Announce Type: new Abstract: This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study…

7
arXiv — NLP / Computation & Language research 1mo ago

PhoneWorld: Scaling Phone-Use Agent Environments

arXiv:2605.29486v1 Announce Type: new Abstract: A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but…

28
arXiv — NLP / Computation & Language research 1mo ago

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

arXiv:2605.29555v1 Announce Type: new Abstract: As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a…

31
arXiv — NLP / Computation & Language research 1mo ago

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

arXiv:2605.29585v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the…

27
arXiv — NLP / Computation & Language research 1mo ago

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries,…

9
arXiv — NLP / Computation & Language research 1mo ago

Personalized Turn-Level User Conversation Satisfaction Benchmark

arXiv:2605.29711v1 Announce Type: new Abstract: User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation…

9
arXiv — NLP / Computation & Language research 1mo ago

Metric-Dependent Annotation Saturation for Learning from Label Distributions

arXiv:2605.29797v1 Announce Type: new Abstract: When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from…

37
arXiv — NLP / Computation & Language research 1mo ago

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

arXiv:2605.29800v1 Announce Type: new Abstract: LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how…

35
r/LocalLLaMA community 1mo ago

llama.cpp B9387 Significant AMD/ROCm PP Update

https://github.com/ggml-org/llama.cpp/releases/tag/b9387 MFMA is restricted to AMD CDNA architecture that's MI100, MI200, MI300 series datacenter cards. Post your initial results if you try it! wink   submitted by   /u/Bulky-Priority6824 [link]   [comments]

38
The Information — AI news-outlet 1mo ago

Base Power in Talks to Raise Funds at $12 Billion Valuation

Base Power, a three-year-old home-battery startup, is in talks to raise funds at a $12 billion valuation, according to a person with knowledge of the discussions. Ribbit Capital, which backed Base Power’s last funding round, has been in talks to lead the current round, according…

17

AI Evaluators Struggle with Models That Know When They’re Being Tested

Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Conformal Reliability: A New Evaluation Metric for Conditional Generation

GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring

Refining Word-Based Grammatical Error Annotation for L2 Korean

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

Pairwise Reference Alignment as a Model-Level Ordinal Observable

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

Bayesian Opt. GPs vs Linear models and Neural Networks for parameter optimizations [R]

OpenRouter raises $113M Series B

The groupthink boom: what 3 top VCs really think about the AI frenzy

Gryphe/Pantheon-Reasoning-27B · Hugging Face

Danish pension fund excludes SpaceX citing governance and valuation

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

This chip startup just raised $135M on a bet that AI&#8217;s biggest bottleneck isn&#8217;t compute &#8212; it&#8217;s memory

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

Comparative Evaluation of Machine Translation Systems on Images with Text

PhoneWorld: Scaling Phone-Use Agent Environments

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

Personalized Turn-Level User Conversation Satisfaction Benchmark

Metric-Dependent Annotation Saturation for Learning from Label Distributions

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

llama.cpp B9387 Significant AMD/ROCm PP Update

Base Power in Talks to Raise Funds at $12 Billion Valuation

This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory