News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow The Information — AI news-outlet 1mo ago AI Evaluators Struggle with Models That Know When They’re Being Tested AI researchers are starting to make progress on a confounding problem: AI models are getting better at telling when they are in an evaluation. That could become a problem for AI companies that use evaluations to gauge the capabilities and behaviors of their models before… 37 arXiv — Machine Learning research 1mo ago Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling arXiv:2605.30376v1 Announce Type: new Abstract: Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore critical inter-channel dependencies, while channel-dependent models are expressive but… 15 arXiv — Machine Learning research 1mo ago A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI arXiv:2605.30388v1 Announce Type: new Abstract: This paper introduces a new systematic framework for detecting anomalies in maritime Automatic Identification System (AIS) datasets. These anomalies include abnormal vessel behaviours related to speed, position jumps, time gaps,… 22 arXiv — Machine Learning research 1mo ago NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models arXiv:2605.30393v1 Announce Type: new Abstract: Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than out-of-sample skill. We introduce NumLeak, a measurement framework that combines API-boundary… 25 arXiv — Machine Learning research 1mo ago MAAT: Multi-phase Adapter-Aware Targeted Unlearning arXiv:2605.30514v1 Announce Type: new Abstract: Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This… 10 arXiv — Machine Learning research 1mo ago Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents arXiv:2605.30590v1 Announce Type: new Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other… 23 arXiv — Machine Learning research 1mo ago Conformal Reliability: A New Evaluation Metric for Conditional Generation arXiv:2605.30807v1 Announce Type: new Abstract: Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is… 8 arXiv — Machine Learning research 1mo ago GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring arXiv:2605.30865v1 Announce Type: new Abstract: Continuous glucose monitoring (CGM) provides a dense view of daily metabolic physiology, yet existing generic time-series and CGM-specific foundation models often encode glucose traces as entangled single-stream sequences, leaving… 31 arXiv — NLP / Computation & Language research 1mo ago Refining Word-Based Grammatical Error Annotation for L2 Korean arXiv:2605.30545v1 Announce Type: new Abstract: Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they… 10 arXiv — NLP / Computation & Language research 1mo ago Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge arXiv:2605.30568v1 Announce Type: new Abstract: LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained… 37 arXiv — NLP / Computation & Language research 1mo ago TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal… 26 arXiv — NLP / Computation & Language research 1mo ago Pairwise Reference Alignment as a Model-Level Ordinal Observable arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference… 18 arXiv — NLP / Computation & Language research 1mo ago A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation arXiv:2605.31351v1 Announce Type: new Abstract: AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general… 30 arXiv — NLP / Computation & Language research 1mo ago LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues… 36 arXiv — NLP / Computation & Language research 1mo ago BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali arXiv:2605.31483v1 Announce Type: new Abstract: Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination… 20 Hugging Face Daily Papers research 1mo ago Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios Abstract Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations. AI-generated summary Recent advances in speech generation have enabled… 5 Hugging Face Daily Papers research 1mo ago OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents Abstract OpenSkillEval is an automatic evaluation framework that assesses skill-augmented agent systems and skills across diverse real-world applications, revealing that skill availability doesn't guarantee effective usage and that performance benefits depend heavily on model… 31 r/LocalLLaMA community 1mo ago PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark) Author here. The short version of why I built this: Cyber-AI evaluation is converging on the same diagnosis from multiple labs. Anthropic's Claude Mythos system card this year: their cyber ranges "lack many features often present in real-world environments such as defensive… 6 r/MachineLearning community 1mo ago Bayesian Opt. GPs vs Linear models and Neural Networks for parameter optimizations [R] Hi, Relatively new to deep learning. I wanted some opinions on which of these approaches might be best for time series data and spectral analysis. I currently use a GP and it works pretty well, but I’m wondering what the computational tradeoffs and so forth might be. Any ideas?… 4 Hacker News — AI on Front Page community 1mo ago OpenRouter raises $113M Series B Article URL: https://openrouter.ai/announcements/series-b Comments URL: https://news.ycombinator.com/item?id=48338660 Points: 242 # Comments: 110 4 TechCrunch — AI news-outlet 1mo ago The groupthink boom: what 3 top VCs really think about the AI frenzy "If you're 22 years old in San Francisco and building something in AI, there may be a seed term sheet in your inbox — but if you're 19, oh my God, this means you're really good; you might already have a Series A [offer]," said one, half-kiddingly. 12 r/LocalLLaMA community 1mo ago Gryphe/Pantheon-Reasoning-27B · Hugging Face from Gryphe: An experiment in bringing reasoning capability to the Pantheon roleplay series in the form of an uncensored dense Qwen 3.6 27B. This specific model can be thought of as a successor to both the Pantheon series and the one-time Codex release since I used such a large… 15 Hacker News — AI on Front Page community 1mo ago Danish pension fund excludes SpaceX citing governance and valuation Article URL: https://www.reuters.com/legal/transactional/danish-pension-fund-excludes-spacex-citing-governance-valuation-2026-05-29/ Comments URL: https://news.ycombinator.com/item?id=48333820 Points: 207 # Comments: 146 23 Hugging Face Daily Papers research 1mo ago Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection Abstract A parameter-efficient vision-language model is developed for time-series anomaly detection using a novel benchmark with natural-language rationales, achieving superior performance and generalization across multiple datasets. AI-generated summary Recent advances in… 38 TechCrunch — AI news-outlet 1mo ago This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory South Korean chip startup Xcena is betting that AI's real bottleneck is not compute, but memory. 20 Hugging Face Daily Papers research 1mo ago PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers Abstract PRISM evaluates automated peer review systems across multiple dimensions using argument mining and retrieval-augmented verification, revealing that while LLMs match human performance in specific areas, no system consistently equals human reviewers across all evaluation… 19 arXiv — Machine Learning research 1mo ago Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models arXiv:2605.28866v1 Announce Type: new Abstract: Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. However, prior studies largely overlook the inherent continuity and ordinality of time series… 20 arXiv — Machine Learning research 1mo ago PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation arXiv:2605.28867v1 Announce Type: new Abstract: Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an… 10 arXiv — Machine Learning research 1mo ago LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers arXiv:2605.29005v1 Announce Type: new Abstract: Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational… 18 arXiv — Machine Learning research 1mo ago Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation arXiv:2605.29108v1 Announce Type: new Abstract: Selecting efficient multi-step synthetic routes is a central challenge in organic synthesis, particularly in medicinal and process chemistry, where route choice directly impacts feasibility, cost, and development efficiency.… 28 arXiv — Machine Learning research 1mo ago RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains arXiv:2605.29156v1 Announce Type: new Abstract: Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit… 9 arXiv — Machine Learning research 1mo ago Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts arXiv:2605.29283v1 Announce Type: new Abstract: Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to… 22 arXiv — Machine Learning research 1mo ago Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems arXiv:2605.29373v1 Announce Type: new Abstract: Solving high-dimensional PDE-governed inverse problems is often challenging due to complex non-Gaussian posterior distributions, expensive forward model evaluations, and misspecified prior information. To address these issues, we… 13 arXiv — Machine Learning research 1mo ago Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities arXiv:2605.29500v1 Announce Type: new Abstract: Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard… 11 arXiv — NLP / Computation & Language research 1mo ago Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated… 19 arXiv — NLP / Computation & Language research 1mo ago GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how… 31 arXiv — NLP / Computation & Language research 1mo ago GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human arXiv:2605.28882v1 Announce Type: new Abstract: With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet… 4 arXiv — NLP / Computation & Language research 1mo ago DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents arXiv:2605.29256v1 Announce Type: new Abstract: Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and… 21 arXiv — NLP / Computation & Language research 1mo ago A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods… 19 arXiv — NLP / Computation & Language research 1mo ago Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework arXiv:2605.29397v1 Announce Type: new Abstract: HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is… 35 arXiv — NLP / Computation & Language research 1mo ago Comparative Evaluation of Machine Translation Systems on Images with Text arXiv:2605.29476v1 Announce Type: new Abstract: This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study… 7 arXiv — NLP / Computation & Language research 1mo ago PhoneWorld: Scaling Phone-Use Agent Environments arXiv:2605.29486v1 Announce Type: new Abstract: A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but… 28 arXiv — NLP / Computation & Language research 1mo ago From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals arXiv:2605.29555v1 Announce Type: new Abstract: As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a… 31 arXiv — NLP / Computation & Language research 1mo ago World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models arXiv:2605.29585v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the… 27 arXiv — NLP / Computation & Language research 1mo ago Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries,… 9 arXiv — NLP / Computation & Language research 1mo ago Personalized Turn-Level User Conversation Satisfaction Benchmark arXiv:2605.29711v1 Announce Type: new Abstract: User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation… 9 arXiv — NLP / Computation & Language research 1mo ago Metric-Dependent Annotation Saturation for Learning from Label Distributions arXiv:2605.29797v1 Announce Type: new Abstract: When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from… 37 arXiv — NLP / Computation & Language research 1mo ago Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels arXiv:2605.29800v1 Announce Type: new Abstract: LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how… 35 r/LocalLLaMA community 1mo ago llama.cpp B9387 Significant AMD/ROCm PP Update https://github.com/ggml-org/llama.cpp/releases/tag/b9387 MFMA is restricted to AMD CDNA architecture that's MI100, MI200, MI300 series datacenter cards. Post your initial results if you try it! wink   submitted by   /u/Bulky-Priority6824 [link]   [comments] 38 The Information — AI news-outlet 1mo ago Base Power in Talks to Raise Funds at $12 Billion Valuation Base Power, a three-year-old home-battery startup, is in talks to raise funds at a $12 billion valuation, according to a person with knowledge of the discussions. Ribbit Capital, which backed Base Power’s last funding round, has been in talks to lead the current round, according… 17 Page 9 of 10 · 500 articles ← Newer Older →