Tag

Funding

500 articles archived under #funding · RSS

OpenAI official-blog 1mo ago

A shared playbook for trustworthy third party evaluations

OpenAI shares guidance on third-party AI evaluations, covering how to assess model capabilities, safeguards, and validity for frontier systems.

22
r/MachineLearning community 1mo ago

Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R]

🌟 Announcing the 2nd Workshop on Social Simulation with LLMs (Social Sim'26) @ COLM 📣 Welcoming Submissions! Submission here:. 🗓️ Deadline: June 23, 2026 (AoE) This year's theme is "Fidelity in Applications”, moving beyond compelling demos toward evaluation, robustness,…

11
The Information — AI news-outlet 1mo ago

The AI Boom’s Pricey Middle

Baseten’s talks to raise fresh funding at an $11 billion valuation are the latest sign that investors are betting the messy work of helping developers run AI models can become one of the next big businesses in AI. That boom has lifted a group of companies including Baseten,…

27
The Information — AI news-outlet 1mo ago

Anthropic Releases New Flagship AI Model

Anthropic on Thursday announced its new flagship AI model, Claude Opus 4.8, which showed improvements in standardized AI performance evaluations in coding, financial analysis and other fields. The company also said the model is more likely to flag uncertainties about its work…

22
The Information — AI news-outlet 1mo ago

Anthropic Raises $65 Billion at $900 Billion Valuation; Micron, Samsung Invest

Anthropic said Thursday it had raised $65 billion at a valuation of $900 billion before the financing, more than double the valuation in a round closed three months earlier. New investors Micron, Samsung and SK Hynix, which make a key component of AI chips, are investing in the…

5
TechCrunch — AI news-outlet 1mo ago

Anthropic raises $65 Billion, nears $1T valuation ahead of IPO

Anthropic has closed a $65 billion Series H round at a $965 billion post-money valuation, marking what could be the AI startup's final private fundraise before a highly anticipated IPO.

14
Hacker News — AI on Front Page community 1mo ago

Anthropic raises $65B in Series H funding at $965B post-money valuation

Article URL: https://www.anthropic.com/news/series-h Comments URL: https://news.ycombinator.com/item?id=48313048 Points: 273 # Comments: 278

24
r/MachineLearning community 1mo ago

Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]

Wall-OSS-0.5 is a new 4B VLA release from X Square Robot, built on a 3B VLM backbone with action experts in a Mixture-of-Transformers layout. What caught my eye is that the report evaluates the pretrained checkpoint on real robots before task-specific fine tuning, instead of…

25
r/LocalLLaMA community 1mo ago

Qwen/Qwen-Image-Bench · Hugging Face

Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality criteria organized in a 3-level hierarchy…

8
Latent.Space news-outlet 1mo ago

[AINews] Cognition raises $1B in $26B Series D

coding is an uncapped TAM market

13
Smol AI News news-outlet 1mo ago

Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows

**Anthropic** announced a massive **$65B Series H financing** at a **$965B valuation**, led by **Altimeter, Dragoneer, Greenoaks, and Sequoia**, with run-rate revenue surpassing **$47B**. They launched **Claude Opus 4.8**, an update to Opus 4.7 featuring "sharper judgment,"…

28
arXiv — Machine Learning research 1mo ago

A Simple State Space Model Excels at Multivariate Time Series Classification

arXiv:2605.27406v1 Announce Type: new Abstract: Structured state space models (SSMs) have recently emerged as a promising foundation for sequence modeling, with Mamba-based architectures demonstrating strong performance through input-dependent state transitions, albeit at…

30
arXiv — Machine Learning research 1mo ago

Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation

arXiv:2605.27486v1 Announce Type: new Abstract: Federated learning (FL) has broadened the horizon for multivariate time series anomaly detection (MTSAD). However, benchmarking such anomaly detection methods within FL paradigm poses data-centric challenges. The existing datasets…

28
arXiv — Machine Learning research 1mo ago

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized…

17
arXiv — Machine Learning research 1mo ago

Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection

arXiv:2605.27992v1 Announce Type: new Abstract: Time series anomaly detection is critical for maintaining the reliability of mission-critical systems. While Transformer-based models like PatchTST have shown remarkable performance, their $\mathcal{O}(L^2)$ computational…

11
arXiv — Machine Learning research 1mo ago

Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector

arXiv:2605.28103v1 Announce Type: new Abstract: We present a unified experiment, analysis, and benchmark study of multivariate time-series (MTS) anomaly detection. Ten family-representative detectors -- spanning statistical, reconstruction, association, frequency, and…

4
arXiv — Machine Learning research 1mo ago

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

arXiv:2605.28203v1 Announce Type: new Abstract: As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional…

14
arXiv — NLP / Computation & Language research 1mo ago

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

arXiv:2605.27393v1 Announce Type: new Abstract: Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce…

7
arXiv — NLP / Computation & Language research 1mo ago

Disentangling Language Roles in Multilingual LLM Task Execution

arXiv:2605.27649v1 Announce Type: new Abstract: Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate…

28
arXiv — NLP / Computation & Language research 1mo ago

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

arXiv:2605.27709v1 Announce Type: new Abstract: Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning…

37
arXiv — NLP / Computation & Language research 1mo ago

ChildEval: When large language models meet children's personalities

arXiv:2605.27805v1 Announce Type: new Abstract: While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a…

20
arXiv — NLP / Computation & Language research 1mo ago

GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

arXiv:2605.27866v1 Announce Type: new Abstract: Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for…

35
arXiv — NLP / Computation & Language research 1mo ago

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

arXiv:2605.27882v1 Announce Type: new Abstract: LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified…

38
arXiv — NLP / Computation & Language research 1mo ago

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

arXiv:2605.27914v1 Announce Type: new Abstract: Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge…

7
arXiv — NLP / Computation & Language research 1mo ago

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

arXiv:2605.27984v1 Announce Type: new Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment…

10
arXiv — NLP / Computation & Language research 1mo ago

Auditing Stance Asymmetry in Generative Explanations

arXiv:2605.27988v1 Announce Type: new Abstract: Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations…

22
arXiv — NLP / Computation & Language research 1mo ago

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

arXiv:2605.28013v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations:…

18
arXiv — NLP / Computation & Language research 1mo ago

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

arXiv:2605.28020v1 Announce Type: new Abstract: With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token…

29
arXiv — NLP / Computation & Language research 1mo ago

ATLAS: All-round Testing of Long-context Abilities across Scales

arXiv:2605.28079v1 Announce Type: new Abstract: Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and…

4
arXiv — NLP / Computation & Language research 1mo ago

Chinese Word Boundary Recovery through Character Alignment Projection

arXiv:2605.28128v1 Announce Type: new Abstract: Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper…

30
arXiv — NLP / Computation & Language research 1mo ago

Why We Need Speech to Evaluate Speech Translation

arXiv:2605.28227v1 Announce Type: new Abstract: Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and…

35
arXiv — NLP / Computation & Language research 1mo ago

Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

arXiv:2605.28313v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs…

38
r/MachineLearning community 1mo ago

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

[R] BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison I’m looking for feedback on a local agent-memory benchmark comparison, especially from people who care about evaluation methodology. I built an open-source R&D memory system called Context Swarm Memory…

31
The Information — AI news-outlet 1mo ago

Coding Startup Cognition Raises $1 Billion at a $26 Billion Valuation

Coding startup Cognition has raised more than $1 billion in a funding round that valued the company at $26 billion including the investment, the company said in a blog post. That’s nearly double its valuation from its last fundraise, which valued the three-year-old company at…

11
Hugging Face Daily Papers research 1mo ago

FastKernels: Benchmarking GPU Kernel Generation in Production

Abstract FastKernels addresses the gap between benchmark evaluation and production performance for LLM kernel agents by providing a representative set of architectures and a production-grade inference framework that aligns evaluation with real-world deployment. AI-generated…

34
TechCrunch — AI news-outlet 1mo ago

AI coding startup Cognition raises $1B at $25B pre-money valuation

As Cognition reaches $492 million in annualized revenue run rate, it more than doubled its valuation in eight months, it says.

15
Hugging Face Daily Papers research 1mo ago

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Abstract A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency. AI-generated summary Social…

30
Hugging Face Daily Papers research 1mo ago

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Abstract A skill-centric agent framework enables continuous improvement of task-solving capabilities through a unified lifecycle of skill creation, memory, management, evaluation, and refinement. AI-generated summary Large language model (LLM) agents rely on reusable skills to…

21
Hugging Face Daily Papers research 1mo ago

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Abstract Agentic CLEAR is an automatic evaluation framework that provides multi-level textual insights into agent behavior through dynamic analysis of LLM interactions across various benchmarks and settings. AI-generated summary Agentic systems are becoming more capable: agents…

19
The Information — AI news-outlet 1mo ago

Inference Provider Baseten in Talks to Double Valuation to $11 Billion

Baseten, a startup that rents out Nvidia AI servers to application developers and helps them customize models, has recently been in talks with investors to raise $1 billion at an $11 billion valuation including the money, The Information reported Tuesday. That would more than…

37
arXiv — Machine Learning research 1mo ago

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

arXiv:2605.26161v1 Announce Type: new Abstract: Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing…

36
arXiv — Machine Learning research 1mo ago

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

arXiv:2605.26191v1 Announce Type: new Abstract: This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or…

22
arXiv — Machine Learning research 1mo ago

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

arXiv:2605.26193v1 Announce Type: new Abstract: Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure…

38
arXiv — Machine Learning research 1mo ago

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

arXiv:2605.26194v1 Announce Type: new Abstract: Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting).…

30
arXiv — Machine Learning research 1mo ago

Function-Valued Causal Influence in Nonlinear Time Series

arXiv:2605.26408v1 Announce Type: new Abstract: Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the…

34
arXiv — Machine Learning research 1mo ago

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

arXiv:2605.26569v1 Announce Type: new Abstract: We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to…

14
arXiv — Machine Learning research 1mo ago

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

arXiv:2605.26690v1 Announce Type: new Abstract: Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often…

9
arXiv — Machine Learning research 1mo ago

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

arXiv:2605.26704v1 Announce Type: new Abstract: Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models…

22
arXiv — Machine Learning research 1mo ago

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

arXiv:2605.26759v1 Announce Type: new Abstract: Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer…

33
arXiv — Machine Learning research 1mo ago

Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability

arXiv:2605.26790v1 Announce Type: new Abstract: Low-thrust trajectory design relies heavily on repeated evaluations of fuel consumption and transfer feasibility, which require expensive optimal control solutions. In this work, we show these quantities can be accurately…

23

A shared playbook for trustworthy third party evaluations

Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R]

The AI Boom’s Pricey Middle

Anthropic Releases New Flagship AI Model

Anthropic Raises $65 Billion at $900 Billion Valuation; Micron, Samsung Invest

Anthropic raises $65 Billion, nears $1T valuation ahead of IPO

Anthropic raises $65B in Series H funding at $965B post-money valuation

Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]

Qwen/Qwen-Image-Bench · Hugging Face

[AINews] Cognition raises $1B in $26B Series D

Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows

A Simple State Space Model Excels at Multivariate Time Series Classification

Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection

Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

Disentangling Language Roles in Multilingual LLM Task Execution

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ChildEval: When large language models meet children's personalities

GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Auditing Stance Asymmetry in Generative Explanations

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

ATLAS: All-round Testing of Long-context Abilities across Scales

Chinese Word Boundary Recovery through Character Alignment Projection

Why We Need Speech to Evaluate Speech Translation

Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

Coding Startup Cognition Raises $1 Billion at a $26 Billion Valuation

FastKernels: Benchmarking GPU Kernel Generation in Production

AI coding startup Cognition raises $1B at $25B pre-money valuation

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Inference Provider Baseten in Talks to Double Valuation to $11 Billion

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

Function-Valued Causal Influence in Nonlinear Time Series

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability