News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow OpenAI official-blog 1mo ago A shared playbook for trustworthy third party evaluations OpenAI shares guidance on third-party AI evaluations, covering how to assess model capabilities, safeguards, and validity for frontier systems. 22 r/MachineLearning community 1mo ago Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R] 🌟 Announcing the 2nd Workshop on Social Simulation with LLMs (Social Sim'26) @ COLM 📣 Welcoming Submissions! Submission here:. 🗓️ Deadline: June 23, 2026 (AoE) This year's theme is "Fidelity in Applications”, moving beyond compelling demos toward evaluation, robustness,… 11 The Information — AI news-outlet 1mo ago The AI Boom’s Pricey Middle Baseten’s talks to raise fresh funding at an $11 billion valuation are the latest sign that investors are betting the messy work of helping developers run AI models can become one of the next big businesses in AI. That boom has lifted a group of companies including Baseten,… 27 The Information — AI news-outlet 1mo ago Anthropic Releases New Flagship AI Model Anthropic on Thursday announced its new flagship AI model, Claude Opus 4.8, which showed improvements in standardized AI performance evaluations in coding, financial analysis and other fields. The company also said the model is more likely to flag uncertainties about its work… 22 The Information — AI news-outlet 1mo ago Anthropic Raises $65 Billion at $900 Billion Valuation; Micron, Samsung Invest Anthropic said Thursday it had raised $65 billion at a valuation of $900 billion before the financing, more than double the valuation in a round closed three months earlier. New investors Micron, Samsung and SK Hynix, which make a key component of AI chips, are investing in the… 5 TechCrunch — AI news-outlet 1mo ago Anthropic raises $65 Billion, nears $1T valuation ahead of IPO Anthropic has closed a $65 billion Series H round at a $965 billion post-money valuation, marking what could be the AI startup's final private fundraise before a highly anticipated IPO. 14 Hacker News — AI on Front Page community 1mo ago Anthropic raises $65B in Series H funding at $965B post-money valuation Article URL: https://www.anthropic.com/news/series-h Comments URL: https://news.ycombinator.com/item?id=48313048 Points: 273 # Comments: 278 24 r/MachineLearning community 1mo ago Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D] Wall-OSS-0.5 is a new 4B VLA release from X Square Robot, built on a 3B VLM backbone with action experts in a Mixture-of-Transformers layout. What caught my eye is that the report evaluates the pretrained checkpoint on real robots before task-specific fine tuning, instead of… 25 r/LocalLLaMA community 1mo ago Qwen/Qwen-Image-Bench · Hugging Face Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality criteria organized in a 3-level hierarchy… 8 Latent.Space news-outlet 1mo ago [AINews] Cognition raises $1B in $26B Series D coding is an uncapped TAM market 13 Smol AI News news-outlet 1mo ago Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows **Anthropic** announced a massive **$65B Series H financing** at a **$965B valuation**, led by **Altimeter, Dragoneer, Greenoaks, and Sequoia**, with run-rate revenue surpassing **$47B**. They launched **Claude Opus 4.8**, an update to Opus 4.7 featuring "sharper judgment,"… 28 arXiv — Machine Learning research 1mo ago A Simple State Space Model Excels at Multivariate Time Series Classification arXiv:2605.27406v1 Announce Type: new Abstract: Structured state space models (SSMs) have recently emerged as a promising foundation for sequence modeling, with Mamba-based architectures demonstrating strong performance through input-dependent state transitions, albeit at… 30 arXiv — Machine Learning research 1mo ago Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation arXiv:2605.27486v1 Announce Type: new Abstract: Federated learning (FL) has broadened the horizon for multivariate time series anomaly detection (MTSAD). However, benchmarking such anomaly detection methods within FL paradigm poses data-centric challenges. The existing datasets… 28 arXiv — Machine Learning research 1mo ago A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized… 17 arXiv — Machine Learning research 1mo ago Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection arXiv:2605.27992v1 Announce Type: new Abstract: Time series anomaly detection is critical for maintaining the reliability of mission-critical systems. While Transformer-based models like PatchTST have shown remarkable performance, their $\mathcal{O}(L^2)$ computational… 11 arXiv — Machine Learning research 1mo ago Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector arXiv:2605.28103v1 Announce Type: new Abstract: We present a unified experiment, analysis, and benchmark study of multivariate time-series (MTS) anomaly detection. Ten family-representative detectors -- spanning statistical, reconstruction, association, frequency, and… 4 arXiv — Machine Learning research 1mo ago Refining Multidimensional Video Reward Models via Disentangled Influence Functions arXiv:2605.28203v1 Announce Type: new Abstract: As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional… 14 arXiv — NLP / Computation & Language research 1mo ago StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation arXiv:2605.27393v1 Announce Type: new Abstract: Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce… 7 arXiv — NLP / Computation & Language research 1mo ago Disentangling Language Roles in Multilingual LLM Task Execution arXiv:2605.27649v1 Announce Type: new Abstract: Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate… 28 arXiv — NLP / Computation & Language research 1mo ago ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation arXiv:2605.27709v1 Announce Type: new Abstract: Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning… 37 arXiv — NLP / Computation & Language research 1mo ago ChildEval: When large language models meet children's personalities arXiv:2605.27805v1 Announce Type: new Abstract: While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a… 20 arXiv — NLP / Computation & Language research 1mo ago GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors arXiv:2605.27866v1 Announce Type: new Abstract: Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for… 35 arXiv — NLP / Computation & Language research 1mo ago VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild arXiv:2605.27882v1 Announce Type: new Abstract: LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified… 38 arXiv — NLP / Computation & Language research 1mo ago Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking arXiv:2605.27914v1 Announce Type: new Abstract: Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge… 7 arXiv — NLP / Computation & Language research 1mo ago KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs arXiv:2605.27984v1 Announce Type: new Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment… 10 arXiv — NLP / Computation & Language research 1mo ago Auditing Stance Asymmetry in Generative Explanations arXiv:2605.27988v1 Announce Type: new Abstract: Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations… 22 arXiv — NLP / Computation & Language research 1mo ago KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks arXiv:2605.28013v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations:… 18 arXiv — NLP / Computation & Language research 1mo ago The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates arXiv:2605.28020v1 Announce Type: new Abstract: With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token… 29 arXiv — NLP / Computation & Language research 1mo ago ATLAS: All-round Testing of Long-context Abilities across Scales arXiv:2605.28079v1 Announce Type: new Abstract: Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and… 4 arXiv — NLP / Computation & Language research 1mo ago Chinese Word Boundary Recovery through Character Alignment Projection arXiv:2605.28128v1 Announce Type: new Abstract: Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper… 30 arXiv — NLP / Computation & Language research 1mo ago Why We Need Speech to Evaluate Speech Translation arXiv:2605.28227v1 Announce Type: new Abstract: Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and… 35 arXiv — NLP / Computation & Language research 1mo ago Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach arXiv:2605.28313v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs… 38 r/MachineLearning community 1mo ago BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R] [R] BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison I’m looking for feedback on a local agent-memory benchmark comparison, especially from people who care about evaluation methodology. I built an open-source R&D memory system called Context Swarm Memory… 31 The Information — AI news-outlet 1mo ago Coding Startup Cognition Raises $1 Billion at a $26 Billion Valuation Coding startup Cognition has raised more than $1 billion in a funding round that valued the company at $26 billion including the investment, the company said in a blog post. That’s nearly double its valuation from its last fundraise, which valued the three-year-old company at… 11 Hugging Face Daily Papers research 1mo ago FastKernels: Benchmarking GPU Kernel Generation in Production Abstract FastKernels addresses the gap between benchmark evaluation and production performance for LLM kernel agents by providing a representative set of architectures and a production-grade inference framework that aligns evaluation with real-world deployment. AI-generated… 34 TechCrunch — AI news-outlet 1mo ago AI coding startup Cognition raises $1B at $25B pre-money valuation As Cognition reaches $492 million in annualized revenue run rate, it more than doubled its valuation in eight months, it says. 15 Hugging Face Daily Papers research 1mo ago QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents Abstract A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency. AI-generated summary Social… 30 Hugging Face Daily Papers research 1mo ago MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation Abstract A skill-centric agent framework enables continuous improvement of task-solving capabilities through a unified lifecycle of skill creation, memory, management, evaluation, and refinement. AI-generated summary Large language model (LLM) agents rely on reusable skills to… 21 Hugging Face Daily Papers research 1mo ago Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents Abstract Agentic CLEAR is an automatic evaluation framework that provides multi-level textual insights into agent behavior through dynamic analysis of LLM interactions across various benchmarks and settings. AI-generated summary Agentic systems are becoming more capable: agents… 19 The Information — AI news-outlet 1mo ago Inference Provider Baseten in Talks to Double Valuation to $11 Billion Baseten, a startup that rents out Nvidia AI servers to application developers and helps them customize models, has recently been in talks with investors to raise $1 billion at an $11 billion valuation including the money, The Information reported Tuesday. That would more than… 37 arXiv — Machine Learning research 1mo ago TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models arXiv:2605.26161v1 Announce Type: new Abstract: Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing… 36 arXiv — Machine Learning research 1mo ago Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series arXiv:2605.26191v1 Announce Type: new Abstract: This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or… 22 arXiv — Machine Learning research 1mo ago Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection arXiv:2605.26193v1 Announce Type: new Abstract: Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure… 38 arXiv — Machine Learning research 1mo ago On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series arXiv:2605.26194v1 Announce Type: new Abstract: Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting).… 30 arXiv — Machine Learning research 1mo ago Function-Valued Causal Influence in Nonlinear Time Series arXiv:2605.26408v1 Announce Type: new Abstract: Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the… 34 arXiv — Machine Learning research 1mo ago Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series arXiv:2605.26569v1 Announce Type: new Abstract: We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to… 14 arXiv — Machine Learning research 1mo ago Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets arXiv:2605.26690v1 Announce Type: new Abstract: Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often… 9 arXiv — Machine Learning research 1mo ago SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation arXiv:2605.26704v1 Announce Type: new Abstract: Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models… 22 arXiv — Machine Learning research 1mo ago Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining arXiv:2605.26759v1 Announce Type: new Abstract: Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer… 33 arXiv — Machine Learning research 1mo ago Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability arXiv:2605.26790v1 Announce Type: new Abstract: Low-thrust trajectory design relies heavily on repeated evaluations of fuel consumption and transfer feasibility, which require expensive optimal control solutions. In this work, we show these quantities can be accurately… 23 Page 10 of 10 · 500 articles ← Newer