News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — NLP / Computation & Language research 13d ago The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse arXiv:2606.20255v1 Announce Type: new Abstract: We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for… 34 arXiv — NLP / Computation & Language research 13d ago Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies arXiv:2606.18649v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female… 38 Hugging Face Daily Papers research 13d ago Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 27 Hugging Face Daily Papers research 13d ago Re-Centering Humans in LLM Personalization Abstract Human-centered evaluation reveals significant gaps between synthetic and real-world LLM personalization performance, with models struggling to extract user attributes and generate truly personalized responses that match human quality judgments. Generated by… 30 TechCrunch — AI news-outlet 13d ago General Intuition in talks to raise $300M at around $2B valuation General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning. 14 Hugging Face Daily Papers research 14d ago A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets Abstract A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predictive code… 17 OpenAI official-blog 14d ago Improving health intelligence in ChatGPT Learn how GPT-5.5 Instant improves ChatGPT’s health and wellness responses with stronger reasoning, better context, clearer communication, and physician-informed evaluations. 7 arXiv — Machine Learning research 14d ago Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting arXiv:2606.18367v1 Announce Type: new Abstract: Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on… 19 arXiv — Machine Learning research 14d ago RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing arXiv:2606.18774v1 Announce Type: new Abstract: We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on… 35 arXiv — Machine Learning research 14d ago Anomaly Detection for Sparse and Irregular Multivariate Time Series with Latent SDEs arXiv:2606.18898v1 Announce Type: new Abstract: Multivariate time series anomaly detection (MTSAD) is critical for a wide range of application areas, such as industrial monitoring, cybersecurity, or healthcare. Real-world data is often sparse, irregularly sampled or partially… 8 arXiv — NLP / Computation & Language research 14d ago Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance arXiv:2606.18613v1 Announce Type: new Abstract: The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication.… 7 arXiv — NLP / Computation & Language research 14d ago Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this… 23 arXiv — NLP / Computation & Language research 14d ago Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering arXiv:2606.18986v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series… 10 arXiv — NLP / Computation & Language research 14d ago Learning User Simulators with Turing Rewards arXiv:2606.19336v1 Announce Type: new Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by… 37 arXiv — NLP / Computation & Language research 14d ago Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment arXiv:2606.18979v1 Announce Type: cross Abstract: Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but… 36 arXiv — NLP / Computation & Language research 14d ago Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation arXiv:2606.19139v1 Announce Type: cross Abstract: Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts,… 15 arXiv — NLP / Computation & Language research 14d ago ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark arXiv:2505.23851v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution… 38 Hugging Face Daily Papers research 14d ago Physics-IQ Verified Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video… 29 r/LocalLLaMA community 14d ago Lin Junyang AI Lab Closes Round at $2B Valuation A new lab from Lin Junyang can only be good news for open source / weights, I think. Excited to see what the lead responsible for the Qwen line does next.   submitted by   /u/rmhubbert [link]   [comments] 38 TechCrunch — AI news-outlet 14d ago World model maker Odyssey nabs $1.45B valuation backed by Amazon and other big names World models are the next big thing in AI beyond LLMs and, with this round, Odyssey has cemented itself as one of the startups to watch. 30 TechCrunch — AI news-outlet 14d ago Pramaana Labs raises $27M seed round from Khosla Ventures to bring formal verification to AI Pramaana will focus on highly sensitive verticals like law, drug discovery, and tax preparation — where errors can be costly and reliability is at a premium. 22 arXiv — Machine Learning research 15d ago Informative Missingness to Generate Irregular Clinical Time Series arXiv:2606.17106v1 Announce Type: new Abstract: Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology,… 8 arXiv — Machine Learning research 15d ago Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis arXiv:2606.17115v1 Announce Type: new Abstract: Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based… 18 arXiv — NLP / Computation & Language research 15d ago Rift: A Conflict Signature for Deception in Language Models arXiv:2606.17229v1 Announce Type: cross Abstract: A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a… 9 arXiv — Machine Learning research 15d ago Offline Preference-Based Trajectory Evaluation arXiv:2606.17541v1 Announce Type: new Abstract: Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective… 20 arXiv — Machine Learning research 15d ago Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting arXiv:2606.17996v1 Announce Type: new Abstract: Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the… 37 arXiv — Machine Learning research 15d ago Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines arXiv:2606.18122v1 Announce Type: new Abstract: Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents… 36 arXiv — Machine Learning research 15d ago RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports arXiv:2606.17062v1 Announce Type: cross Abstract: Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation… 11 arXiv — NLP / Computation & Language research 15d ago MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation arXiv:2606.17449v1 Announce Type: new Abstract: While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation… 38 arXiv — NLP / Computation & Language research 15d ago AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows arXiv:2606.17474v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential,… 17 arXiv — NLP / Computation & Language research 15d ago Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement arXiv:2606.17506v1 Announce Type: new Abstract: Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate… 4 arXiv — NLP / Computation & Language research 15d ago Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings arXiv:2606.17542v1 Announce Type: new Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction.… 19 arXiv — NLP / Computation & Language research 15d ago The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the… 7 arXiv — NLP / Computation & Language research 15d ago Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs arXiv:2606.17634v1 Announce Type: new Abstract: Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has… 30 arXiv — NLP / Computation & Language research 15d ago Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation arXiv:2606.17820v1 Announce Type: new Abstract: This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of… 28 arXiv — NLP / Computation & Language research 15d ago When Multiple Scripts Matter: Evaluating ASR in Clinical Settings arXiv:2606.17826v1 Announce Type: new Abstract: Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics… 20 arXiv — NLP / Computation & Language research 15d ago HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice arXiv:2606.18103v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual… 8 arXiv — NLP / Computation & Language research 15d ago RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills arXiv:2606.18203v1 Announce Type: new Abstract: The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an… 28 arXiv — NLP / Computation & Language research 15d ago Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization arXiv:2606.17092v1 Announce Type: cross Abstract: Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a… 8 arXiv — NLP / Computation & Language research 15d ago Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation arXiv:2606.17188v1 Announce Type: cross Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal… 15 arXiv — NLP / Computation & Language research 15d ago EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning… 38 arXiv — NLP / Computation & Language research 15d ago FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback arXiv:2601.04574v2 Announce Type: replace Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate… 7 Hugging Face Daily Papers research 15d ago GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive… 31 TechCrunch — AI news-outlet 15d ago SpaceX valuation balloons to $2.6T, briefly passes Amazon SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday. 13 TechCrunch — AI news-outlet 15d ago SpaceX passes Amazon as valuation balloons to $2.7T SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday. 31 Hugging Face Daily Papers research 15d ago Artificial Intelligence Index Report 2026 Abstract Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's… 32 The Information — AI news-outlet 16d ago SpaceX finalizes $60 billion deal to acquire Cursor SpaceX announced it agreed to buy AI coding startup Cursor for $60 billion on Tuesday. The announcement came only a few days after SpaceX went public at a valuation of about $1.77 trillion. Since the IPO, SpaceX stock has risen 42% to close on Monday at $193.50, valuing it at… 37 Hugging Face Daily Papers research 16d ago Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long… 28 arXiv — Machine Learning research 16d ago Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,… 23 arXiv — Machine Learning research 16d ago Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning arXiv:2606.15115v1 Announce Type: new Abstract: Multi-objective optimization (MOO) has emerged as a powerful approach to solving complex optimization problems involving multiple objectives. In many practical scenarios, function evaluations are unavailable or prohibitively… 7 Page 4 of 10 · 500 articles ← Newer Older →