News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — NLP / Computation & Language research 21d ago NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track arXiv:2606.11199v1 Announce Type: new Abstract: We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather… 24 arXiv — NLP / Computation & Language research 21d ago BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts arXiv:2606.11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting… 29 arXiv — NLP / Computation & Language research 21d ago Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,… 16 arXiv — NLP / Computation & Language research 21d ago Agent Skill Evaluation and Evolution: Frameworks and Benchmarks arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in… 20 arXiv — NLP / Computation & Language research 21d ago AI Coding Agents Can Reproduce Social Science Findings arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks… 8 arXiv — NLP / Computation & Language research 21d ago Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness arXiv:2606.11686v1 Announce Type: new Abstract: End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed… 14 arXiv — NLP / Computation & Language research 21d ago Automated Creativity Evaluation of Language Models Across Open-Ended Tasks arXiv:2606.11762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable… 14 arXiv — NLP / Computation & Language research 21d ago Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation arXiv:2606.12117v1 Announce Type: new Abstract: Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the… 27 arXiv — NLP / Computation & Language research 21d ago Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application arXiv:2606.12191v1 Announce Type: new Abstract: Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work… 18 Hugging Face Daily Papers research 21d ago TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders Abstract TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard… 6 Hugging Face Daily Papers research 21d ago Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application Abstract Large language model agents require specialized environments for training and evaluation, which can be categorized by their engineering lifecycle stages and evolved through various paradigms including neural and symbolic approaches. Generated by… 8 Hugging Face Daily Papers research 21d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by… 38 TechCrunch — AI news-outlet 22d ago Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in AI coding agent startup Niteshift has raised a $7 million seed round from a who's who of angels. It's betting companies will want power over, not lock-in with model makers. 31 Hugging Face Daily Papers research 22d ago Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests Abstract CapCode framework uses randomized testing with performance caps to detect and prevent shortcut exploitation in agent evaluation, while CapReward rewards systems that adhere to intended task specifications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A growing failure… 21 Hugging Face Daily Papers research 22d ago Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields Abstract Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent years have witnessed the rapid… 15 arXiv — Machine Learning research 22d ago Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models arXiv:2606.09861v1 Announce Type: new Abstract: While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete… 5 arXiv — Machine Learning research 22d ago Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,… 23 arXiv — Machine Learning research 22d ago Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection arXiv:2606.09874v1 Announce Type: new Abstract: Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often… 22 arXiv — Machine Learning research 22d ago FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses arXiv:2606.09878v1 Announce Type: new Abstract: Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model… 20 arXiv — Machine Learning research 22d ago SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting arXiv:2606.09917v1 Announce Type: new Abstract: Multivariate time series forecasting requires capturing the continuously evolving correlation structure among interacting variables. Existing state-space models process time series by scanning tokenized temporal or spatial… 30 arXiv — Machine Learning research 22d ago Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization arXiv:2606.10068v1 Announce Type: new Abstract: Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many… 35 arXiv — Machine Learning research 22d ago Structured Adaptive Tensor Prediction for Streaming Data arXiv:2606.10085v1 Announce Type: new Abstract: Matrix-valued time series arise in a wide range of applications, such as spatio-temporal data from medical imaging and geophysics. Existing methods are mainly designed for static settings and lack adaptability to streaming and… 33 arXiv — Machine Learning research 22d ago MMClima: A Framework for Multimodal Climate Science Data and Evaluation arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We… 20 arXiv — Machine Learning research 22d ago Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series arXiv:2606.10219v1 Announce Type: new Abstract: AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems:… 35 arXiv — NLP / Computation & Language research 22d ago Automated Scoring of Arabic Text Using Large Language Models: A Literature Review arXiv:2606.09830v1 Announce Type: new Abstract: In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and… 25 arXiv — NLP / Computation & Language research 22d ago VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation arXiv:2606.11079v1 Announce Type: new Abstract: Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to… 14 arXiv — NLP / Computation & Language research 22d ago LLM-Based Code Documentation Generation and Multi-Judge Evaluation arXiv:2606.09852v1 Announce Type: cross Abstract: High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates… 36 arXiv — NLP / Computation & Language research 22d ago $\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems arXiv:2606.10156v1 Announce Type: cross Abstract: As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce… 11 Hugging Face Daily Papers research 22d ago When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated… 12 r/MachineLearning community 22d ago Phinite — multi-agent OS with first-class agent identity, composable skills, behavioral evaluation [P] We spent the last year building what we think is the missing infrastructure layer for multi-agent systems. Open to everyone starting today. The technical problem: Agents have no identity. In microservices you have a service mesh + IAM. In agent systems you have a Python file. We… 12 Hugging Face Daily Papers research 22d ago Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle Abstract Reference-free faithfulness metrics suffer from a blind spot measuring only precision, leading to rewards for abstention; completeness in deterministic domains enables measurement of both precision and recall, revealing that high-precision models often have poor fact… 34 TechCrunch — AI news-outlet 23d ago Sandstone raises $30M to bring AI to in-house legal teams Sandstone's Series A was led by Lightspeed Partners, with participation from Sequoia. 22 TechCrunch — AI news-outlet 23d ago How an e-scooter founder raised $5 million to build space data centers Orbital founder Euwyn Poon built 250,000 scooters at Spin. Now he wants to launch 10,000 space data centers. 27 Hugging Face Daily Papers research 23d ago Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data Abstract Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences. Generated by… 37 Hugging Face Daily Papers research 23d ago Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill Abstract Skill-RM presents a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models… 19 arXiv — Machine Learning research 23d ago SRT: Super-Resolution for Time Series via Disentangled Rectified Flow arXiv:2606.07605v1 Announce Type: new Abstract: Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be… 15 arXiv — Machine Learning research 23d ago Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods arXiv:2606.07607v1 Announce Type: new Abstract: Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable… 9 arXiv — Machine Learning research 23d ago Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation arXiv:2606.07616v1 Announce Type: new Abstract: Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference… 5 arXiv — Machine Learning research 23d ago Learning Transfers: Kan Extensions for Neural Invariants arXiv:2606.07627v1 Announce Type: new Abstract: Transfer learning presumes that a representation learned on source tasks carries structure that remains usable on related target tasks. Standard evaluations probe this through target accuracy or distributional discrepancy, yet… 8 arXiv — Machine Learning research 23d ago Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment arXiv:2606.07632v1 Announce Type: new Abstract: Proper accounting of the energy requirements and environmental impact of artificial intelligence (AI) systems is necessary for researchers, developers, policy makers, and users to assess the barriers to building systems at scale.… 36 arXiv — Machine Learning research 23d ago Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction arXiv:2606.07698v1 Announce Type: new Abstract: Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by… 23 arXiv — Machine Learning research 23d ago Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no… 13 arXiv — Machine Learning research 23d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under… 33 Hugging Face Daily Papers research 23d ago Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key… 20 TechCrunch — AI news-outlet 23d ago Mercor’s Brendan Foody calls out Sequoia over ‘dual-pricing’ valuation tricks Sequoia is just one of the top firms that sells same equity at two different prices. 28 The Information — AI news-outlet 23d ago Databricks in Talks to Raise at Above $165 Billion Valuation Databricks, a provider of database management software, has discussed raising more money in a funding round that could kick off within the next month, according to multiple people with direct knowledge of the conversations. Databricks has indicated to investors the new round… 13 Hugging Face Daily Papers research 24d ago Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms Abstract A novel attack-agnostic robustness metric based on Fisher Information Matrix spectral norm is proposed, providing theoretical bounds and scalable evaluation methods for deep neural network robustness assessment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The… 12 Hugging Face Daily Papers research 24d ago Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by… 35 Hugging Face Daily Papers research 24d ago How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling Abstract Small adaptation interfaces extend a frozen Music Transformer model to multiple genres, showing consistent improvement in harmonic prediction but limited genre identity representation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Harmony is a compact symbolic layer… 6 r/MachineLearning community 24d ago Open image generation models are closer to closed-source quality than this sub thinks [D] I run evaluations on generative image models as part of my workflow, mostly comparing coherence, prompt adherence, and compositional accuracy across different architectures. The consensus here seems to be that open models are still a generation behind closed APIs. Based on my… 25 Page 6 of 10 · 500 articles ← Newer Older →