Tag

Funding

500 articles archived under #funding · RSS

arXiv — NLP / Computation & Language research 21d ago

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

arXiv:2606.11199v1 Announce Type: new Abstract: We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather…

24
arXiv — NLP / Computation & Language research 21d ago

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

arXiv:2606.11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting…

29
arXiv — NLP / Computation & Language research 21d ago

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,…

16
arXiv — NLP / Computation & Language research 21d ago

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in…

20
arXiv — NLP / Computation & Language research 21d ago

AI Coding Agents Can Reproduce Social Science Findings

arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks…

8
arXiv — NLP / Computation & Language research 21d ago

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

arXiv:2606.11686v1 Announce Type: new Abstract: End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed…

14
arXiv — NLP / Computation & Language research 21d ago

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

arXiv:2606.11762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable…

14
arXiv — NLP / Computation & Language research 21d ago

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

arXiv:2606.12117v1 Announce Type: new Abstract: Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the…

27
arXiv — NLP / Computation & Language research 21d ago

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

arXiv:2606.12191v1 Announce Type: new Abstract: Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work…

18
Hugging Face Daily Papers research 21d ago

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Abstract TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard…

6
Hugging Face Daily Papers research 21d ago

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Abstract Large language model agents require specialized environments for training and evaluation, which can be categorized by their engineering lifecycle stages and evolved through various paradigms including neural and symbolic approaches. Generated by…

8
Hugging Face Daily Papers research 21d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by…

38
TechCrunch — AI news-outlet 22d ago

Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in

AI coding agent startup Niteshift has raised a $7 million seed round from a who's who of angels. It's betting companies will want power over, not lock-in with model makers.

31
Hugging Face Daily Papers research 22d ago

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Abstract CapCode framework uses randomized testing with performance caps to detect and prevent shortcut exploitation in agent evaluation, while CapReward rewards systems that adhere to intended task specifications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A growing failure…

21
Hugging Face Daily Papers research 22d ago

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Abstract Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent years have witnessed the rapid…

15
arXiv — Machine Learning research 22d ago

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

arXiv:2606.09861v1 Announce Type: new Abstract: While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete…

5
arXiv — Machine Learning research 22d ago

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,…

23
arXiv — Machine Learning research 22d ago

Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

arXiv:2606.09874v1 Announce Type: new Abstract: Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often…

22
arXiv — Machine Learning research 22d ago

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

arXiv:2606.09878v1 Announce Type: new Abstract: Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model…

20
arXiv — Machine Learning research 22d ago

SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting

arXiv:2606.09917v1 Announce Type: new Abstract: Multivariate time series forecasting requires capturing the continuously evolving correlation structure among interacting variables. Existing state-space models process time series by scanning tokenized temporal or spatial…

30
arXiv — Machine Learning research 22d ago

Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization

arXiv:2606.10068v1 Announce Type: new Abstract: Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many…

35
arXiv — Machine Learning research 22d ago

Structured Adaptive Tensor Prediction for Streaming Data

arXiv:2606.10085v1 Announce Type: new Abstract: Matrix-valued time series arise in a wide range of applications, such as spatio-temporal data from medical imaging and geophysics. Existing methods are mainly designed for static settings and lack adaptability to streaming and…

33
arXiv — Machine Learning research 22d ago

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We…

20
arXiv — Machine Learning research 22d ago

Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

arXiv:2606.10219v1 Announce Type: new Abstract: AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems:…

35
arXiv — NLP / Computation & Language research 22d ago

Automated Scoring of Arabic Text Using Large Language Models: A Literature Review

arXiv:2606.09830v1 Announce Type: new Abstract: In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and…

25
arXiv — NLP / Computation & Language research 22d ago

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

arXiv:2606.11079v1 Announce Type: new Abstract: Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to…

14
arXiv — NLP / Computation & Language research 22d ago

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

arXiv:2606.09852v1 Announce Type: cross Abstract: High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates…

36
arXiv — NLP / Computation & Language research 22d ago

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

arXiv:2606.10156v1 Announce Type: cross Abstract: As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce…

11
Hugging Face Daily Papers research 22d ago

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated…

12
r/MachineLearning community 22d ago

Phinite — multi-agent OS with first-class agent identity, composable skills, behavioral evaluation [P]

We spent the last year building what we think is the missing infrastructure layer for multi-agent systems. Open to everyone starting today. The technical problem: Agents have no identity. In microservices you have a service mesh + IAM. In agent systems you have a Python file. We…

12
Hugging Face Daily Papers research 22d ago

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

Abstract Reference-free faithfulness metrics suffer from a blind spot measuring only precision, leading to rewards for abstention; completeness in deterministic domains enables measurement of both precision and recall, revealing that high-precision models often have poor fact…

34
TechCrunch — AI news-outlet 23d ago

Sandstone raises $30M to bring AI to in-house legal teams

Sandstone's Series A was led by Lightspeed Partners, with participation from Sequoia.

22
TechCrunch — AI news-outlet 23d ago

How an e-scooter founder raised $5 million to build space data centers

Orbital founder Euwyn Poon built 250,000 scooters at Spin. Now he wants to launch 10,000 space data centers.

27
Hugging Face Daily Papers research 23d ago

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Abstract Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences. Generated by…

37
Hugging Face Daily Papers research 23d ago

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Abstract Skill-RM presents a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models…

19
arXiv — Machine Learning research 23d ago

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

arXiv:2606.07605v1 Announce Type: new Abstract: Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be…

15
arXiv — Machine Learning research 23d ago

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

arXiv:2606.07607v1 Announce Type: new Abstract: Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable…

9
arXiv — Machine Learning research 23d ago

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

arXiv:2606.07616v1 Announce Type: new Abstract: Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference…

5
arXiv — Machine Learning research 23d ago

Learning Transfers: Kan Extensions for Neural Invariants

arXiv:2606.07627v1 Announce Type: new Abstract: Transfer learning presumes that a representation learned on source tasks carries structure that remains usable on related target tasks. Standard evaluations probe this through target accuracy or distributional discrepancy, yet…

8
arXiv — Machine Learning research 23d ago

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

arXiv:2606.07632v1 Announce Type: new Abstract: Proper accounting of the energy requirements and environmental impact of artificial intelligence (AI) systems is necessary for researchers, developers, policy makers, and users to assess the barriers to building systems at scale.…

36
arXiv — Machine Learning research 23d ago

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

arXiv:2606.07698v1 Announce Type: new Abstract: Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by…

23
arXiv — Machine Learning research 23d ago

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no…

13
arXiv — Machine Learning research 23d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under…

33
Hugging Face Daily Papers research 23d ago

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key…

20
TechCrunch — AI news-outlet 23d ago

Mercor’s Brendan Foody calls out Sequoia over ‘dual-pricing’ valuation tricks

Sequoia is just one of the top firms that sells same equity at two different prices.

28
The Information — AI news-outlet 23d ago

Databricks in Talks to Raise at Above $165 Billion Valuation

Databricks, a provider of database management software, has discussed raising more money in a funding round that could kick off within the next month, according to multiple people with direct knowledge of the conversations. Databricks has indicated to investors the new round…

13
Hugging Face Daily Papers research 24d ago

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Abstract A novel attack-agnostic robustness metric based on Fisher Information Matrix spectral norm is proposed, providing theoretical bounds and scalable evaluation methods for deep neural network robustness assessment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The…

12
Hugging Face Daily Papers research 24d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35
Hugging Face Daily Papers research 24d ago

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

Abstract Small adaptation interfaces extend a frozen Music Transformer model to multiple genres, showing consistent improvement in harmonic prediction but limited genre identity representation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Harmony is a compact symbolic layer…

6
r/MachineLearning community 24d ago

Open image generation models are closer to closed-source quality than this sub thinks [D]

I run evaluations on generative image models as part of my workflow, mostly comparing coherence, prompt adherence, and compositional accuracy across different architectures. The consensus here seems to be that open models are still a generation behind closed APIs. Based on my…

25

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

AI Coding Agents Can Reproduce Social Science Findings

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting

Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization

Structured Adaptive Tensor Prediction for Streaming Data

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

Automated Scoring of Arabic Text Using Large Language Models: A Literature Review

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Phinite — multi-agent OS with first-class agent identity, composable skills, behavioral evaluation [P]

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

Sandstone raises $30M to bring AI to in-house legal teams

How an e-scooter founder raised $5 million to build space data centers

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

Learning Transfers: Kan Extensions for Neural Invariants

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Mercor’s Brendan Foody calls out Sequoia over &#8216;dual-pricing&#8217; valuation tricks

Databricks in Talks to Raise at Above $165 Billion Valuation

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

Open image generation models are closer to closed-source quality than this sub thinks [D]

Mercor’s Brendan Foody calls out Sequoia over ‘dual-pricing’ valuation tricks