Tag

Funding

500 articles archived under #funding · RSS

arXiv — NLP / Computation & Language research 13d ago

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

arXiv:2606.20255v1 Announce Type: new Abstract: We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for…

34
arXiv — NLP / Computation & Language research 13d ago

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

arXiv:2606.18649v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female…

38
Hugging Face Daily Papers research 13d ago

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

27
Hugging Face Daily Papers research 13d ago

Re-Centering Humans in LLM Personalization

Abstract Human-centered evaluation reveals significant gaps between synthetic and real-world LLM personalization performance, with models struggling to extract user attributes and generate truly personalized responses that match human quality judgments. Generated by…

30
TechCrunch — AI news-outlet 13d ago

General Intuition in talks to raise $300M at around $2B valuation

General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning.

14
Hugging Face Daily Papers research 14d ago

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

Abstract A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predictive code…

17
OpenAI official-blog 14d ago

Improving health intelligence in ChatGPT

Learn how GPT-5.5 Instant improves ChatGPT’s health and wellness responses with stronger reasoning, better context, clearer communication, and physician-informed evaluations.

7
arXiv — Machine Learning research 14d ago

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

arXiv:2606.18367v1 Announce Type: new Abstract: Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on…

19
arXiv — Machine Learning research 14d ago

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

arXiv:2606.18774v1 Announce Type: new Abstract: We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on…

35
arXiv — Machine Learning research 14d ago

Anomaly Detection for Sparse and Irregular Multivariate Time Series with Latent SDEs

arXiv:2606.18898v1 Announce Type: new Abstract: Multivariate time series anomaly detection (MTSAD) is critical for a wide range of application areas, such as industrial monitoring, cybersecurity, or healthcare. Real-world data is often sparse, irregularly sampled or partially…

8
arXiv — NLP / Computation & Language research 14d ago

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

arXiv:2606.18613v1 Announce Type: new Abstract: The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication.…

7
arXiv — NLP / Computation & Language research 14d ago

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this…

23
arXiv — NLP / Computation & Language research 14d ago

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

arXiv:2606.18986v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series…

10
arXiv — NLP / Computation & Language research 14d ago

Learning User Simulators with Turing Rewards

arXiv:2606.19336v1 Announce Type: new Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by…

37
arXiv — NLP / Computation & Language research 14d ago

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

arXiv:2606.18979v1 Announce Type: cross Abstract: Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but…

36
arXiv — NLP / Computation & Language research 14d ago

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

arXiv:2606.19139v1 Announce Type: cross Abstract: Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts,…

15
arXiv — NLP / Computation & Language research 14d ago

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

arXiv:2505.23851v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution…

38
Hugging Face Daily Papers research 14d ago

Physics-IQ Verified

Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video…

29
r/LocalLLaMA community 14d ago

Lin Junyang AI Lab Closes Round at $2B Valuation

A new lab from Lin Junyang can only be good news for open source / weights, I think. Excited to see what the lead responsible for the Qwen line does next.   submitted by   /u/rmhubbert [link]   [comments]

38
TechCrunch — AI news-outlet 14d ago

World model maker Odyssey nabs $1.45B valuation backed by Amazon and other big names

World models are the next big thing in AI beyond LLMs and, with this round, Odyssey has cemented itself as one of the startups to watch.

30
TechCrunch — AI news-outlet 14d ago

Pramaana Labs raises $27M seed round from Khosla Ventures to bring formal verification to AI

Pramaana will focus on highly sensitive verticals like law, drug discovery, and tax preparation — where errors can be costly and reliability is at a premium.

22
arXiv — Machine Learning research 15d ago

Informative Missingness to Generate Irregular Clinical Time Series

arXiv:2606.17106v1 Announce Type: new Abstract: Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology,…

8
arXiv — Machine Learning research 15d ago

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

arXiv:2606.17115v1 Announce Type: new Abstract: Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based…

18
arXiv — NLP / Computation & Language research 15d ago

Rift: A Conflict Signature for Deception in Language Models

arXiv:2606.17229v1 Announce Type: cross Abstract: A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a…

9
arXiv — Machine Learning research 15d ago

Offline Preference-Based Trajectory Evaluation

arXiv:2606.17541v1 Announce Type: new Abstract: Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective…

20
arXiv — Machine Learning research 15d ago

Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting

arXiv:2606.17996v1 Announce Type: new Abstract: Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the…

37
arXiv — Machine Learning research 15d ago

Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines

arXiv:2606.18122v1 Announce Type: new Abstract: Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents…

36
arXiv — Machine Learning research 15d ago

RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

arXiv:2606.17062v1 Announce Type: cross Abstract: Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation…

11
arXiv — NLP / Computation & Language research 15d ago

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

arXiv:2606.17449v1 Announce Type: new Abstract: While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation…

38
arXiv — NLP / Computation & Language research 15d ago

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

arXiv:2606.17474v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential,…

17
arXiv — NLP / Computation & Language research 15d ago

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

arXiv:2606.17506v1 Announce Type: new Abstract: Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate…

4
arXiv — NLP / Computation & Language research 15d ago

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

arXiv:2606.17542v1 Announce Type: new Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction.…

19
arXiv — NLP / Computation & Language research 15d ago

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the…

7
arXiv — NLP / Computation & Language research 15d ago

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

arXiv:2606.17634v1 Announce Type: new Abstract: Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has…

30
arXiv — NLP / Computation & Language research 15d ago

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

arXiv:2606.17820v1 Announce Type: new Abstract: This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of…

28
arXiv — NLP / Computation & Language research 15d ago

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

arXiv:2606.17826v1 Announce Type: new Abstract: Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics…

20
arXiv — NLP / Computation & Language research 15d ago

HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

arXiv:2606.18103v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual…

8
arXiv — NLP / Computation & Language research 15d ago

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

arXiv:2606.18203v1 Announce Type: new Abstract: The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an…

28
arXiv — NLP / Computation & Language research 15d ago

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

arXiv:2606.17092v1 Announce Type: cross Abstract: Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a…

8
arXiv — NLP / Computation & Language research 15d ago

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

arXiv:2606.17188v1 Announce Type: cross Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal…

15
arXiv — NLP / Computation & Language research 15d ago

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning…

38
arXiv — NLP / Computation & Language research 15d ago

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

arXiv:2601.04574v2 Announce Type: replace Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate…

7
Hugging Face Daily Papers research 15d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive…

31
TechCrunch — AI news-outlet 15d ago

SpaceX valuation balloons to $2.6T, briefly passes Amazon

SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday.

13
TechCrunch — AI news-outlet 15d ago

SpaceX passes Amazon as valuation balloons to $2.7T

SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday.

31
Hugging Face Daily Papers research 15d ago

Artificial Intelligence Index Report 2026

Abstract Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's…

32
The Information — AI news-outlet 16d ago

SpaceX finalizes $60 billion deal to acquire Cursor

SpaceX announced it agreed to buy AI coding startup Cursor for $60 billion on Tuesday. The announcement came only a few days after SpaceX went public at a valuation of about $1.77 trillion. Since the IPO, SpaceX stock has risen 42% to close on Monday at $193.50, valuing it at…

37
Hugging Face Daily Papers research 16d ago

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long…

28
arXiv — Machine Learning research 16d ago

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,…

23
arXiv — Machine Learning research 16d ago

Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning

arXiv:2606.15115v1 Announce Type: new Abstract: Multi-objective optimization (MOO) has emerged as a powerful approach to solving complex optimization problems involving multiple objectives. In many practical scenarios, function evaluations are unavailable or prohibitively…

7

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Re-Centering Humans in LLM Personalization

General Intuition in talks to raise $300M at around $2B valuation

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

Improving health intelligence in ChatGPT

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

Anomaly Detection for Sparse and Irregular Multivariate Time Series with Latent SDEs

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

Learning User Simulators with Turing Rewards

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

Physics-IQ Verified

Lin Junyang AI Lab Closes Round at $2B Valuation

World model maker Odyssey nabs $1.45B valuation backed by Amazon and other big names

Pramaana Labs raises $27M seed round from Khosla Ventures to bring formal verification to AI

Informative Missingness to Generate Irregular Clinical Time Series

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

Rift: A Conflict Signature for Deception in Language Models

Offline Preference-Based Trajectory Evaluation

Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting

Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines

RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

SpaceX valuation balloons to $2.6T, briefly passes Amazon

SpaceX passes Amazon as valuation balloons to $2.7T

Artificial Intelligence Index Report 2026

SpaceX finalizes $60 billion deal to acquire Cursor

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning