News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — NLP / Computation & Language research 2d ago Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios? arXiv:2606.29920v1 Announce Type: new Abstract: Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is… 17 TechCrunch — AI news-outlet 2d ago Omen AI’s plan to optimize data centers is all wet Omen AI raised a $31 million Series A to monitor chip coolant and stop bacterial outbreaks in data centers. 8 arXiv — Machine Learning research 3d ago Unified Zero-Shot Time Series Forecasting: A Darts Foundation arXiv:2606.27438v1 Announce Type: new Abstract: Since its initial release in 2020, Darts has become a widely used open-source Python library for time series analysis. A series of foundation models have recently claimed accuracy improvements in zero-shot forecasting, promising a… 15 arXiv — Machine Learning research 3d ago Productionized Fairness Measurement Under Privacy Constraints arXiv:2606.27558v1 Announce Type: new Abstract: Fairness measurements in the form of disaggregated evaluations often rely on demographic signals that are legally constrained or culturally sensitive. Race and ethnicity signals are among the more difficult signals to curate and… 34 arXiv — Machine Learning research 3d ago Quantum Generative Diffusion Model for Real-World Time Series arXiv:2606.27561v1 Announce Type: new Abstract: Generative models have achieved remarkable success in data synthesis, though recent advances driven by increasing model scale have introduced challenges in computational cost and efficiency. Quantum machine learning offers a… 10 arXiv — Machine Learning research 3d ago GNBAN: Graph Neural Basis Attention Networks for Long-Horizon Forecasting over Large Entity Sets arXiv:2606.27863v1 Announce Type: new Abstract: Demand forecasting at the bottom of a retail hierarchy requires predicting tens of thousands of correlated long-horizon series across products, stores, and regions. Modern systems must scale across massive catalogs, capture shared… 33 arXiv — Machine Learning research 3d ago TA-SparseMG: Trend-Aware Sparse Forecasting via Multi-Scale Gating for Long-Term Time Series arXiv:2606.27908v1 Announce Type: new Abstract: Long-term time series forecasting finds extensive applications in domains such as power demand, traffic flow, meteorological observation, and renewable energy dispatch. Forecasting dynamically varying long-term time series poses… 21 arXiv — Machine Learning research 3d ago Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets… 21 arXiv — Machine Learning research 3d ago COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives arXiv:2606.28194v1 Announce Type: new Abstract: While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on… 18 arXiv — Machine Learning research 3d ago Democratic ICAI: Debating Our Way to Steering Principles from Preferences arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the… 38 arXiv — NLP / Computation & Language research 3d ago Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs arXiv:2606.27378v1 Announce Type: new Abstract: We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks.… 29 arXiv — NLP / Computation & Language research 3d ago Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs arXiv:2606.27909v1 Announce Type: new Abstract: Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever… 15 arXiv — NLP / Computation & Language research 3d ago Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA arXiv:2606.28050v1 Announce Type: new Abstract: LLM-as-a-Judge and self-evaluation pipelines implicitly assume that evaluation is easier than generation. We test this in a controlled in-context QA setting where a context passage is the sole information source and each model… 29 arXiv — NLP / Computation & Language research 3d ago Subject-level Inference for Realistic Text Anonymization Evaluation arXiv:2604.21211v2 Announce Type: replace Abstract: Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations,… 6 r/LocalLLaMA community 3d ago DeepSpec - a deepseek-ai Collection DeepSpec DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding. It contains data preparation utilities, draft model implementations, training code, and evaluation scripts. Released Checkpoints The checkpoints below are the ones used… 26 r/LocalLLaMA community 4d ago I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0. I have been running an open evaluation setup where N models answer the same prompt, then blind-grade each other in an N x N matrix with self-judgments excluded. No single privileged judge. So far: 286 evaluations, 198 hand-written questions, 22,254 valid judgments across 55… 35 Hugging Face Daily Papers research 4d ago COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami Abstract A computational origami system generates crease patterns from natural language using AI-driven optimization and aesthetic evaluation, enabling human-AI collaboration in mathematically constrained design. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While generative AI… 11 r/MachineLearning community 4d ago Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P] When evaluating migrating production LLM workloads off commercial cloud APIs, the conversation usually gets oversimplified into a trade-off between quality and infrastructure cost. To look past clean, isolated averages, I built a repeatable evaluation matrix using a real-world… 29 r/LocalLLaMA community 4d ago Orthrus (diffusion head) trained Qwen 3.5/3.6 and Gemma 4 models are dropping soon "Hi all, we are finalized with our testing and are preparing the release pipeline. We will be releasing support for the Qwen3.5, Qwen3.6, and Gemma4 very soon. Alongside the model checkpoints, we will be open-sourcing our complete end-to-end training and evaluation code. Stay… 19 arXiv — Machine Learning research 6d ago Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's… 4 arXiv — Machine Learning research 6d ago The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators arXiv:2606.26294v1 Announce Type: new Abstract: Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier,… 25 arXiv — Machine Learning research 6d ago EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning arXiv:2606.26327v1 Announce Type: new Abstract: In actor-critic reinforcement learning, network architectures are typically manually designed. Automating this design is challenging because each candidate must be trained before evaluation, and the design space is open-ended. To… 29 arXiv — NLP / Computation & Language research 6d ago DualEval: Joint Model-Item Calibration for Unified LLM Evaluation arXiv:2606.26429v1 Announce Type: cross Abstract: Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce… 24 arXiv — Machine Learning research 6d ago Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform arXiv:2606.26590v1 Announce Type: new Abstract: Security misconfigurations in Terraform Infrastructure-as-Code are a growing risk in cloud deployments, and large language models are increasingly used as automated repair agents. Existing evaluations often treat a repair as… 5 arXiv — Machine Learning research 6d ago Target-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space arXiv:2606.26657v1 Announce Type: new Abstract: Identifying high-utility candidates from massive discrete spaces under expensive evaluations is a recurring challenge across the sciences, with structure-based drug discovery as a prominent example. While surrogate-based… 20 arXiv — Machine Learning research 6d ago Decision-Aligned Evaluation of Uncertainty Quantification arXiv:2606.26990v1 Announce Type: new Abstract: Uncertainty estimates in machine learning are typically evaluated using generic metrics such as the negative log-likelihood and expected calibration error, yet good performance on such metrics does not necessarily imply high… 13 arXiv — NLP / Computation & Language research 6d ago Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models arXiv:2606.26101v1 Announce Type: new Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a… 21 arXiv — NLP / Computation & Language research 6d ago From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's… 12 arXiv — NLP / Computation & Language research 6d ago ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent arXiv:2606.26403v1 Announce Type: new Abstract: Foundation-model research increasingly needs data about people: user state, personal histories, relationships, contact-like fields, documents, and longitudinal updates. Real user data is difficult to share, perturb, audit, or… 34 arXiv — NLP / Computation & Language research 6d ago Evaluation Pitfalls and Challenges in Multimedia Event Extraction arXiv:2606.26775v1 Announce Type: new Abstract: Multimedia event extraction aims to jointly identify events and their arguments across multiple modalities, such as text and images, to support more comprehensive event understanding. While recent work reports steady and… 15 arXiv — NLP / Computation & Language research 6d ago Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While… 36 arXiv — NLP / Computation & Language research 6d ago Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents arXiv:2606.26479v1 Announce Type: cross Abstract: Recent work (2024 to 2026) has converged on a strategy for defending tool-using LLM agents against indirect prompt injection: rather than training the model to refuse malicious instructions, enforce security outside the model… 38 arXiv — NLP / Computation & Language research 6d ago Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against… 18 arXiv — NLP / Computation & Language research 6d ago Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement arXiv:2606.27226v1 Announce Type: cross Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores… 14 Hugging Face Daily Papers research 6d ago GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents Abstract Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a… 7 Hugging Face Daily Papers research 6d ago Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are… 23 TechCrunch — AI news-outlet 6d ago General Intuition’s $2.3B bet that video games can train AI agents for the real world General Intuition has raised $320 million to scale AI trained on millions of hours of gameplay, betting action data can help AI develop something closer to human intuition. 25 TechCrunch — AI news-outlet 6d ago Netris raises $15M Series A from a16z to help AI neoclouds go live faster Netris provides software that runs on network switches, and offers a platform that helps neocloud operators reduce the time it takes to go live. 36 Hugging Face Daily Papers research 7d ago CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression Abstract Two-channel evaluation shows output compression reduces costs while input compression increases costs and degrades accuracy across models and datasets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct "Talk short. Drop grammar. Save token." This caveman style is widely… 28 arXiv — Machine Learning research 7d ago Towards Continuous Power Forecasting: Practical Continual Learning for Real-World Energy Systems in Nonstationary Time Series arXiv:2606.24955v1 Announce Type: new Abstract: Power forecasting models deployed in real-world energy markets must operate under nonstationary conditions, where data distributions continually evolve due to weather variability, infrastructure upgrades, and changing consumption… 24 arXiv — Machine Learning research 7d ago Adapt Only When It Pays: Budgeted Decision-Loss Priority for Delayed Online Time-Series Adaptation arXiv:2606.25068v1 Announce Type: new Abstract: Online time-series forecasters receive labels only after horizon-dependent delays, while every adaptation step spends limited compute. We study when an online learner should update, not how to adapt at every opportunity, and… 18 arXiv — Machine Learning research 7d ago An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture, leaf area index, and plant height from Sentinel-1 and Sentinel-2 time series arXiv:2606.25174v1 Announce Type: new Abstract: Field-scale retrieval of surface soil moisture (SM), leaf area index (LAI), and plant height (PH) is essential for precision agriculture, yet it remains an ill-posed inverse problem. Concurrent variations in soil moisture and… 24 arXiv — Machine Learning research 7d ago UC-Search: Risk-Aware Test-Time Search for Delayed Constrained Time-Series Control arXiv:2606.25274v1 Announce Type: new Abstract: Time-series models are usually scored as forecasters, yet deployed systems often require delayed decisions under uncertainty and hard feasibility constraints. UC-Search is a model-agnostic test-time wrapper: a backbone emits… 29 arXiv — Machine Learning research 7d ago TopoCast: A Topological Fidelity Framework for Evaluating Transformer-Based Time Series Forecasting arXiv:2606.25439v1 Announce Type: new Abstract: Deep learning-based models have achieved state-of-the-art performance in Time Series Forecasting (TSF), yet their evaluation remains dominated by pointwise error metrics such as Mean Squared Error (MSE), which quantify numerical… 37 arXiv — NLP / Computation & Language research 7d ago The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms arXiv:2606.25450v1 Announce Type: cross Abstract: Traditional evaluations measure a learning algorithm's final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a… 12 arXiv — Machine Learning research 7d ago Leaking Circuit Secrets: Gradient Leakage Attacks on Graph Neural Networks arXiv:2606.25589v1 Announce Type: new Abstract: As graph neural networks (GNNs) become standard tools for critical tasks in circuit design and analysis, their security and privacy risks require careful attention. Here, we present the first comprehensive evaluation of gradient… 20 arXiv — NLP / Computation & Language research 7d ago LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges arXiv:2606.25057v1 Announce Type: new Abstract: The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent… 11 arXiv — NLP / Computation & Language research 7d ago Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One arXiv:2606.25449v1 Announce Type: new Abstract: A language model's memory can be worse than having no memory at all. Give a model a memory that kept a wrong conclusion but dropped the work behind it, and it emits that stale value as a confident answer; give the same model an… 30 arXiv — NLP / Computation & Language research 7d ago A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation arXiv:2606.25476v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and… 36 arXiv — NLP / Computation & Language research 7d ago Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization arXiv:2606.25656v1 Announce Type: new Abstract: As advanced RAG variants like GraphRAG and Agentic RAG emerge, one leading question is when and how to use them. Here, we introduce a framework for different RAG scenarios evaluation and comparison on semi-structured knowledge… 21 Page 2 of 10 · 500 articles ← Newer Older →