Tag

Research papers

500 articles archived under #paper · RSS

arXiv — NLP / Computation & Language research 1d ago

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

arXiv:2606.31002v1 Announce Type: cross Abstract: Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean…

35
arXiv — NLP / Computation & Language research 1d ago

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

arXiv:2606.31054v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive…

37
arXiv — NLP / Computation & Language research 1d ago

Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021

arXiv:2606.31081v1 Announce Type: cross Abstract: The present study analyzed over 26,000 research articles published between 1991 and 2021 in twenty-one major LIS (Library and Information Science) journals, using the machine learning (ML) approach to categorize the research…

5
arXiv — NLP / Computation & Language research 1d ago

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

arXiv:2606.31128v1 Announce Type: cross Abstract: Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion…

30
arXiv — NLP / Computation & Language research 1d ago

PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

arXiv:2606.31148v1 Announce Type: cross Abstract: 3D Visual Grounding (3DVG) aims to localize target objects in 3D scenes given natural language descriptions. Existing approaches typically perform reasoning over the entire scene, leading to ambiguous predictions and high…

17
arXiv — NLP / Computation & Language research 1d ago

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite…

29
arXiv — NLP / Computation & Language research 1d ago

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

arXiv:2606.31270v1 Announce Type: cross Abstract: Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these…

20
arXiv — NLP / Computation & Language research 1d ago

The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills

arXiv:2606.31272v1 Announce Type: cross Abstract: AI agents increasingly acquire and execute skills at runtime: bundles of prompt instructions, executable code, and tool declarations fetched from marketplaces and other agents. Governing them needs a stable notion of skill…

16
arXiv — NLP / Computation & Language research 1d ago

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

arXiv:2606.31407v1 Announce Type: cross Abstract: Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows…

15
arXiv — NLP / Computation & Language research 1d ago

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

arXiv:2606.31435v1 Announce Type: cross Abstract: Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or…

38
arXiv — NLP / Computation & Language research 1d ago

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

arXiv:2606.31511v1 Announce Type: cross Abstract: In deployment settings where retraining is infeasible, small frozen code models are routinely asked to repair a failed program after seeing their own failing output, usually treated as a retry mechanism. From a Popperian view, a…

9
arXiv — NLP / Computation & Language research 1d ago

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

arXiv:2606.31543v1 Announce Type: cross Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I…

4
arXiv — NLP / Computation & Language research 1d ago

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

arXiv:2606.31693v1 Announce Type: cross Abstract: The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation…

38
arXiv — NLP / Computation & Language research 1d ago

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

arXiv:2606.31694v1 Announce Type: cross Abstract: For robots manipulating open-world objects, tactile representations must generalize to unseen materials. We introduce RCT (Robotic Contact Tactile), a robot-collected touch-vision-language dataset with 29,279 tactile frames from…

18
arXiv — NLP / Computation & Language research 1d ago

SpikeLogBERT: Energy-Efficient Log Parsing Using Spiking Transformer Networks

arXiv:2606.31781v1 Announce Type: cross Abstract: Log parsing is a fundamental step in automated log analysis, transforming raw system logs into structured event templates for downstream tasks such as anomaly detection and system monitoring. Existing log parsing methods range…

17
arXiv — NLP / Computation & Language research 1d ago

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

arXiv:2606.31966v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a…

4
arXiv — NLP / Computation & Language research 1d ago

Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models

arXiv:2410.12341v4 Announce Type: replace Abstract: As AI-generated content increasingly populates the web, generative AI models are at growing risk of being trained on their own outputs, a process known as AI autophagy. This feedback loop has been shown to induce model…

16
arXiv — NLP / Computation & Language research 1d ago

Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

arXiv:2502.15845v2 Announce Type: replace Abstract: Large Language Models (LLMs) often hallucinate, limiting their reliability in sensitive applications. In black-box settings, several self-consistency-based techniques have been proposed for hallucination detection. We…

29
arXiv — NLP / Computation & Language research 1d ago

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

arXiv:2504.07385v3 Announce Type: replace Abstract: As Large Language Models (LLMs) become increasingly used for question-answering (QA), relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. Meanwhile,…

26
arXiv — NLP / Computation & Language research 1d ago

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

arXiv:2506.17294v3 Announce Type: replace Abstract: The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding research area, offering advantages such as scalable availability and personalized narration. However, existing…

17
arXiv — NLP / Computation & Language research 1d ago

The Bidirectional Process Reward Model

arXiv:2508.01682v3 Announce Type: replace Abstract: Process Reward Models (PRMs), which assign fine-grained scores to intermediate reasoning steps within a solution trajectory, have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs).…

5
arXiv — NLP / Computation & Language research 1d ago

Rethinking On-policy Optimization for Query Augmentation

arXiv:2510.17139v3 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or…

28
arXiv — NLP / Computation & Language research 1d ago

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

arXiv:2512.21002v3 Announce Type: replace Abstract: Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt…

28
arXiv — NLP / Computation & Language research 1d ago

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

arXiv:2601.04126v3 Announce Type: replace Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present…

29
arXiv — NLP / Computation & Language research 1d ago

What If We Allocate Test-Time Compute Adaptively?

arXiv:2602.01070v5 Announce Type: replace Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning…

30
arXiv — NLP / Computation & Language research 1d ago

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

arXiv:2602.06625v2 Announce Type: replace Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and…

7
arXiv — NLP / Computation & Language research 1d ago

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

arXiv:2603.19453v3 Announce Type: replace Abstract: We propose an LLM harness that generates code-based policy functions for multi-agent environments, evaluates them with self-play, and refines them using feedback from previous iterations. Following the recent line of work in…

28
Hacker News — AI on Front Page community 1d ago

ArXiv's Next Chapter

Article URL: https://blog.arxiv.org/2026/06/30/arxivs-next-chapter/ Comments URL: https://news.ycombinator.com/item?id=48741748 Points: 200 # Comments: 59

12
Hugging Face Daily Papers research 2d ago

DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

Abstract DreamForge-World 0.1 Preview adapts a video generation architecture with a residual action pathway to enable real-time interactive world simulation on consumer hardware with low computational requirements. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

18
r/LocalLLaMA community 2d ago

PageStorm: A Model Built for Creative Book Writing

Over a year ago, we set out to build a single-turn full-book writing model. Half a year ago, we published our LongPage Dataset for book scale creative writing. Today, we are announcing our first model: PageStorm Research Preview. Paper: https://arxiv.org/abs/2605.17064 Models:…

9
Hugging Face Daily Papers research 2d ago

TheoremGraph: Bridging Formal and Informal Mathematics

Abstract A unified mathematical dependency graph connects informal and formal mathematics through semantic embedding and automated extraction from arXiv papers and Lean projects. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Mathematical knowledge is organized around statements…

32
r/LocalLLaMA community 2d ago

InternScience/Agents-A1 · Hugging Face

Unbelievable benchmarks for a 35B MoE, somebody verify. Here is tech report btw: https://arxiv.org/pdf/2606.30616   submitted by   /u/mlon_eusk-_- [link]   [comments]

23
arXiv — Machine Learning research 2d ago

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing…

36
arXiv — Machine Learning research 2d ago

On the Necessity of a Liquid Substrate for Mesh Intelligence

arXiv:2606.28413v1 Announce Type: new Abstract: A mesh of sovereign agents has no center: no shared clock, no shared model, and no coordinator to gather data or retrain. Its competence rests on each agent folding the projections its peers emit into a single internal state,…

8
arXiv — Machine Learning research 2d ago

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

arXiv:2606.28433v1 Announce Type: new Abstract: One goal in reinforcement learning (RL) research is to understand general-purpose sequential decision-making, using benchmark simulators as a proxy for learning in deployment settings. When running experiments, however, the goal of…

5
arXiv — Machine Learning research 2d ago

Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter

arXiv:2606.28441v1 Announce Type: new Abstract: Online latent state estimation constitutes a fundamental challenge within the artificial intelligence field, serving as a foundational tool for diverse applications, including sequential decision making, anomaly and change-point…

21
arXiv — Machine Learning research 2d ago

S-GAI: Spectral Geometry-Aware Initialization for Sigmoidal MLPs -- From Dataset Geometry to Network Weights

arXiv:2606.28444v1 Announce Type: new Abstract: Classical universal approximation theorems establish the expressive power of sigmoidal multilayer perceptrons, but they do not prescribe how initial weights should encode the geometry of a data distribution. We propose S-GAI, a…

31
arXiv — Machine Learning research 2d ago

scKDGM: KAN-guided Dynamic Graph Masked Learning for Single-Cell RNA-seq Clustering

arXiv:2606.28459v1 Announce Type: new Abstract: Single-cell RNA sequencing (scRNA-seq) clustering is essential for identifying cell types, but high dimensionality, sparsity, dropout, and technical noise hinder robust expression representation and cell graph construction.…

27
arXiv — Machine Learning research 2d ago

Counterfactual Residual Data Augmentation for Regression

arXiv:2606.28460v1 Announce Type: new Abstract: Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel…

21
arXiv — Machine Learning research 2d ago

Singular Learning and Occam's Razor in Deep Monomial Networks

arXiv:2606.28464v1 Announce Type: new Abstract: In the optimization of neural networks, gradient dynamics are influenced by critical points that arise from the model's architecture. These critical points occur where the Jacobian of the model's parametrization is rank-deficient,…

11
arXiv — Machine Learning research 2d ago

An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations

arXiv:2606.28467v1 Announce Type: new Abstract: Appliance-level energy monitoring in office buildings produces noisy alerts that non-expert facility managers struggle to use. This paper proposes an end-to-end agentic pipeline that combines deep time-series forecasting,…

11
arXiv — Machine Learning research 2d ago

Modelling Emotional Memory in Children with Tensor Networks

arXiv:2606.28470v1 Announce Type: new Abstract: We demonstrate how emotional valence influences the order-dependent structure of children's recognition memory: correct recall of a sequence of emotionally-valenced toys depended not just on the valence of a given toy itself, but…

7
arXiv — Machine Learning research 2d ago

A Trainable-by-Parts Operator Learning Framework: Bridging DeepONet and Karhunen-Loeve Expansions for Large-Scale Applications

arXiv:2606.28519v1 Announce Type: new Abstract: Training operator-learning models for large-scale problems governed by partial differential equations (PDEs) is challenging due to the curse of dimensionality, memory constraints, and limited training data. These challenges arise…

38
arXiv — Machine Learning research 2d ago

A Gravitational Interpretation of Fine-Tuning Reversion

arXiv:2606.28525v1 Announce Type: new Abstract: Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently…

27
arXiv — Machine Learning research 2d ago

NIVA: A Multimodal Foundation Model for Actionable Earth System Intelligence

arXiv:2606.28546v1 Announce Type: new Abstract: Recent advances in AI-driven weather and climate modeling have improved forecast skill while reducing computational cost. However, existing data-driven approaches are limited in their ability to model coupled Earth system dynamics,…

9
arXiv — Machine Learning research 2d ago

Improving Coherence in Hierarchical Time Series Forecasting using Structured Temporal Fusion

arXiv:2606.28553v1 Announce Type: new Abstract: In many real-world applications, such as retail sales, energy usage, and supply chain planning, forecasting is performed across hierarchical structures. These structures often represent aggregations (e.g., products to categories to…

29
arXiv — Machine Learning research 2d ago

Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings

arXiv:2606.28572v1 Announce Type: new Abstract: The axiom of choice has divided the foundations of mathematics for over a century, but the distinction between classical and constructive proofs has remained a philosophical and methodological one. We use Lean 4's kernel-level…

8
arXiv — Machine Learning research 2d ago

Replica Symmetry Breaking and Algorithmic Thresholds in Empirical Risk Minimization under Multi-Index Model

arXiv:2606.28573v1 Announce Type: new Abstract: Modern machine learning models are trained by optimizing high-dimensional non-convex empirical risk functions. Such cost functions can have a multitude of local optima and yet, gradient-based optimization appears to converge to…

10
arXiv — Machine Learning research 2d ago

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

arXiv:2606.28615v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these…

31
arXiv — Machine Learning research 2d ago

Randomized Exploration for Linear Bandits via Absolute Perturbations

arXiv:2606.28616v1 Announce Type: new Abstract: In stochastic linear bandits, the canonical Upper Confidence Bound (UCB) algorithm admits a simple frequentist regret analysis but can be computationally demanding, while Thompson Sampling (TS) is computationally attractive yet…

27

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

SpikeLogBERT: Energy-Efficient Log Parsing Using Spiking Transformer Networks

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models

Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

The Bidirectional Process Reward Model

Rethinking On-policy Optimization for Query Augmentation

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

What If We Allocate Test-Time Compute Adaptively?

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

ArXiv's Next Chapter

DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

PageStorm: A Model Built for Creative Book Writing

TheoremGraph: Bridging Formal and Informal Mathematics

InternScience/Agents-A1 · Hugging Face

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

On the Necessity of a Liquid Substrate for Mesh Intelligence

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter

S-GAI: Spectral Geometry-Aware Initialization for Sigmoidal MLPs -- From Dataset Geometry to Network Weights

scKDGM: KAN-guided Dynamic Graph Masked Learning for Single-Cell RNA-seq Clustering

Counterfactual Residual Data Augmentation for Regression

Singular Learning and Occam's Razor in Deep Monomial Networks

An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations

Modelling Emotional Memory in Children with Tensor Networks

A Trainable-by-Parts Operator Learning Framework: Bridging DeepONet and Karhunen-Loeve Expansions for Large-Scale Applications

A Gravitational Interpretation of Fine-Tuning Reversion

NIVA: A Multimodal Foundation Model for Actionable Earth System Intelligence

Improving Coherence in Hierarchical Time Series Forecasting using Structured Temporal Fusion

Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings

Replica Symmetry Breaking and Algorithmic Thresholds in Empirical Risk Minimization under Multi-Index Model

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

Randomized Exploration for Linear Bandits via Absolute Perturbations