Tag

Research papers

500 articles archived under #paper · RSS

arXiv — NLP / Computation & Language research 8h ago

Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

arXiv:2607.01047v1 Announce Type: new Abstract: Complexity and interpretability rarely coincide: systems rich enough for complex behaviours to emerge are usually too opaque to question, while transparent ones are too simple for anything complex to emerge. A single large language…

33
arXiv — NLP / Computation & Language research 8h ago

Message Passing Enables Efficient Reasoning

arXiv:2607.01077v1 Announce Type: new Abstract: While inference-time scaling has improved the reasoning abilities of large language models (LLMs), the need to generate long chains-of-thought (CoTs) is a computational bottleneck. Thus, in contrast to sequential scaling methods…

37
arXiv — NLP / Computation & Language research 8h ago

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

arXiv:2607.01103v1 Announce Type: new Abstract: Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration…

12
arXiv — NLP / Computation & Language research 8h ago

Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

arXiv:2607.01115v1 Announce Type: new Abstract: University stakeholders often face difficulties in accessing timely and reliable information, especially in developing countries, where there are very few intelligent support systems. Existing rule-based chatbots are unable to…

15
arXiv — NLP / Computation & Language research 8h ago

$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space

arXiv:2607.01127v1 Announce Type: new Abstract: Quantization has become an invaluable tool to reduce memory requirements and inference speed of modern language models, in particular to make them available for consumer setups and edge devices. While previous work has primarily…

17
arXiv — NLP / Computation & Language research 8h ago

AGC-Bench: Measuring Artificial General Creativity

arXiv:2607.01152v1 Announce Type: new Abstract: Creativity research has debated whether creativity is domain-specific (e.g., visual, writing, science), and if it is psychometrically separable from general intelligence. Both questions now apply to LLMs, but a unified benchmark of…

10
arXiv — NLP / Computation & Language research 8h ago

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded…

14
arXiv — NLP / Computation & Language research 8h ago

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

arXiv:2607.01208v1 Announce Type: new Abstract: Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and…

15
arXiv — NLP / Computation & Language research 8h ago

The State-Prediction Separation Hypothesis

arXiv:2607.01218v1 Announce Type: new Abstract: Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles…

32
arXiv — NLP / Computation & Language research 8h ago

Measuring the Gap Between Human and LLM Research Ideas

arXiv:2607.01233v1 Announce Type: new Abstract: LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human…

8
arXiv — NLP / Computation & Language research 8h ago

Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

arXiv:2507.15692v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect…

22
arXiv — NLP / Computation & Language research 8h ago

Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework

arXiv:2607.00010v1 Announce Type: cross Abstract: Conversational recommender systems (CRSs) are a core component of next-generation intelligent recommender systems because they enable users to actively elicit preferences, clarify intentions, and adapt recommendations in real…

4
arXiv — NLP / Computation & Language research 8h ago

Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

arXiv:2607.00017v1 Announce Type: cross Abstract: Long-term conversational agents are expected to remember past interactions, but memory is useful only when the right evidence is recalled for the right user. Existing memory-augmented LLM agents have made progress in building…

30
arXiv — NLP / Computation & Language research 8h ago

Destination-Labeled Self-Looping Systems with Dwell: Intrinsic Characterization, Realization Cost, and Recognition

arXiv:2607.00044v1 Announce Type: cross Abstract: We study a finite-state symbolic controller for systems in which the admissible visible transitions are fixed in advance and each visible state carries a minimum dwell requirement. The resulting model, which we call a…

32
arXiv — NLP / Computation & Language research 8h ago

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

arXiv:2607.00233v1 Announce Type: cross Abstract: How do two agents invent a shared language from scratch? In a Lewis signaling game, a sender and receiver must coordinate on a code using only their interaction history. We study five memory architectures across varying channel…

26
arXiv — NLP / Computation & Language research 8h ago

An LLM-Based Framework for Intent-Driven Network Topology Design

arXiv:2607.00292v1 Announce Type: cross Abstract: Designing deployable and resilient network topologies from natural language requirements remains a challenging problem in network automation. This work investigates the ability of Large Language Models (LLMs) to generate…

35
arXiv — NLP / Computation & Language research 8h ago

Rosetta: Composable Native Multimodal Pretraining

arXiv:2607.00293v1 Announce Type: cross Abstract: Achieving true artificial general intelligence requires foundation models capable of integrating new modalities without forgetting prior knowledge. However, accommodating continuous generative objectives alongside discrete…

15
arXiv — NLP / Computation & Language research 8h ago

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

arXiv:2607.00309v1 Announce Type: cross Abstract: We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct…

19
arXiv — NLP / Computation & Language research 8h ago

Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

arXiv:2607.00374v1 Announce Type: cross Abstract: Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks…

5
arXiv — NLP / Computation & Language research 8h ago

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

arXiv:2607.00394v1 Announce Type: cross Abstract: LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain largely ad-hoc. We formalize this as an online semantic cache replacement…

22
arXiv — NLP / Computation & Language research 8h ago

NeuroCogMap Reveals Cognitive Organization of Large Language Models

arXiv:2607.00397v1 Announce Type: cross Abstract: Understanding how complex cognitive functions are organized within artificial systems is central to interpreting large language models (LLMs) and relating them to biological cognition. Yet although LLMs exhibit broad…

24
arXiv — NLP / Computation & Language research 8h ago

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

arXiv:2607.00465v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same…

8
arXiv — NLP / Computation & Language research 8h ago

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.…

21
arXiv — NLP / Computation & Language research 8h ago

Self-Evolving Agents with Anytime-Valid Certificates

arXiv:2607.00871v1 Announce Type: cross Abstract: Self-evolving agents violate the assumption behind most learning-theoretic guarantees: the data, evaluator, components, and hypothesis space are produced by the policy being updated. We present \textbf{SEA}, an architecture that…

27
arXiv — NLP / Computation & Language research 8h ago

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

arXiv:2607.00924v1 Announce Type: cross Abstract: Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable…

36
arXiv — NLP / Computation & Language research 8h ago

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

arXiv:2607.01061v1 Announce Type: cross Abstract: Computer-assisted synthesis planning breaks target molecules into accessible precursors using large libraries of reaction rules that assign each transformation a deterministic, interpretable label. But chemistry is long-tailed,…

17
arXiv — NLP / Computation & Language research 8h ago

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

arXiv:2607.01161v1 Announce Type: cross Abstract: Cross-lingual speaker verification (SV) systems typically exhibit performance degradation when enrollment and test utterances are spoken in different languages. However, standard evaluation protocols confound language mismatch…

16
arXiv — NLP / Computation & Language research 8h ago

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

arXiv:2607.01223v1 Announce Type: cross Abstract: When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the…

18
arXiv — NLP / Computation & Language research 8h ago

AutoMem: Automated Learning of Memory as a Cognitive Skill

arXiv:2607.01224v1 Announce Type: cross Abstract: Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as…

35
arXiv — NLP / Computation & Language research 8h ago

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

arXiv:2503.13445v3 Announce Type: replace Abstract: When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work,…

4
arXiv — NLP / Computation & Language research 8h ago

GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

arXiv:2507.05740v2 Announce Type: replace Abstract: Language models are powerful artifacts, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely…

30
arXiv — NLP / Computation & Language research 8h ago

Toward Cybersecurity-Expert Small Language Models

arXiv:2510.14113v2 Announce Type: replace Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal…

30
arXiv — NLP / Computation & Language research 8h ago

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

arXiv:2510.24434v3 Announce Type: replace Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning…

10
arXiv — NLP / Computation & Language research 8h ago

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

arXiv:2510.24636v3 Announce Type: replace Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and…

34
arXiv — NLP / Computation & Language research 8h ago

Reasoning Up the Instruction Ladder for Controllable Language Models

arXiv:2511.04694v5 Announce Type: replace Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources within a single prompt context. Enforcing an instruction…

17
arXiv — NLP / Computation & Language research 8h ago

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

arXiv:2511.07397v3 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller,…

22
arXiv — NLP / Computation & Language research 8h ago

Graded strength of comparative illusions is explained by Bayesian inference

arXiv:2511.14642v2 Announce Type: replace Abstract: Like visual processing, language processing is susceptible to illusions in which people systematically misperceive stimuli. In one such case--the comparative illusion (CI), e.g., More students have been to Russia than I…

33
r/MachineLearning community 1d ago

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv’s next chapter: Updates on our spin out from Cornell University: https://blog.arxiv.org/2026/06/30/arxivs-next-chapter/   submitted by   /u/Nunki08 [link]   [comments]

12
arXiv — Machine Learning research 1d ago

Joint discovery of governing partial differential equations from multi-source datasets by competitive optimization

arXiv:2606.30699v1 Announce Type: new Abstract: Discovering governing equations directly from observational data is a key step towards interpretable scientific machine learning. Current data-driven approaches typically operate on a single dataset, inherently limiting their…

38
arXiv — Machine Learning research 1d ago

Accelerometry-Derived Digital Biomarkers for Cardiometabolic Risk: A Population-Representative Tabular Benchmark with Uncertainty Quantification

arXiv:2606.30702v1 Announce Type: new Abstract: Structured tabular data dominates clinical medicine, yet existing benchmarks fail to reflect real-world properties like complex survey sampling, demographic oversampling, and subgroup fairness. We introduce the NHANES Accelerometry…

31
arXiv — NLP / Computation & Language research 1d ago

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators

arXiv:2606.30704v1 Announce Type: cross Abstract: Large language models (LLMs) excel across a wide range of tasks, yet their instance-specific solutions often lack the structural consistency needed for reliable deployment. Workflows that encode recurring algorithmic patterns at…

13
arXiv — Machine Learning research 1d ago

Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts

arXiv:2606.30705v1 Announce Type: new Abstract: Deterministic few-step generation succeeds on continuous image latents but collapses to incoherent text on continuous text latents, and we show the cause is geometric rather than a training or scaling deficiency: a smooth,…

30
arXiv — Machine Learning research 1d ago

Hierarchical Global Attention (HGA)

arXiv:2606.30709v1 Announce Type: new Abstract: Hierarchical Global Attention (HGA) is a drop-in replacement for dense causal attention in pretrained long-context transformers. HGA preserves the original checkpoint parameters: the pretrained $W_Q$, $W_K$, $W_V$, and $W_O$…

23
arXiv — Machine Learning research 1d ago

ReactionAtlas: Ab origine exploration of chemical reaction networks with machine learning

arXiv:2606.30778v1 Announce Type: new Abstract: Mapping a chemical reaction network, the graph of minima and transition states (TS) and the elementary reactions connecting them, is the natural language of chemistry, from catalysis to combustion to the origin of life.…

5
arXiv — NLP / Computation & Language research 1d ago

Revocable Learned State via Process Sidecars

arXiv:2606.30788v1 Announce Type: cross Abstract: Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not…

17
arXiv — Machine Learning research 1d ago

Predictable GRPO: A Closed-Form Model of Training Dynamics

arXiv:2606.30789v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has become a standard tool for improving the reasoning ability of large language models, yet its training dynamics are still described empirically: reward trajectories are fit with…

16
arXiv — Machine Learning research 1d ago

Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization

arXiv:2606.30813v1 Announce Type: new Abstract: Deep neural networks with repeated architectural blocks, such as transformers, often exhibit structured relationships across layers that emerge during training. Motivated by this observation, we introduce \emph{Depth-wise Gradient…

25
arXiv — Machine Learning research 1d ago

Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias

arXiv:2606.30821v1 Announce Type: new Abstract: Probabilistic downscaling is the task of modeling the conditional distribution of high-resolution fields given coarse inputs, and is a central challenge to atmospheric science, climate modeling, and other multiscale physical…

21
arXiv — Machine Learning research 1d ago

Partition-Guided Distance Saliency: Bridging Decision and Objective Spaces in Many-Objective Optimization

arXiv:2606.30836v1 Announce Type: new Abstract: Explainability in Many-Objective Optimization (MaO) is currently hindered by the escalating complexity of the Pareto front, which renders the relationship between high-dimensional decision variables and objective outcomes…

16
arXiv — Machine Learning research 1d ago

A Stationary-Distribution Theory for Triplet-Based Plateau Search in Random Forest Ensemble-Size Selection

arXiv:2606.30837v1 Announce Type: new Abstract: The number of trees is a central computational parameter in Random Forests: increasing it reduces finite-ensemble variability but increases training and prediction cost. Plateau-based tuning adapts this parameter through local…

18

Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

Message Passing Enables Efficient Reasoning

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space

AGC-Bench: Measuring Artificial General Creativity

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

The State-Prediction Separation Hypothesis

Measuring the Gap Between Human and LLM Research Ideas

Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework

Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

Destination-Labeled Self-Looping Systems with Dwell: Intrinsic Characterization, Realization Cost, and Recognition

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

An LLM-Based Framework for Intent-Driven Network Topology Design

Rosetta: Composable Native Multimodal Pretraining

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

NeuroCogMap Reveals Cognitive Organization of Large Language Models

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

Self-Evolving Agents with Anytime-Valid Certificates

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

AutoMem: Automated Learning of Memory as a Cognitive Skill

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

Toward Cybersecurity-Expert Small Language Models

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Reasoning Up the Instruction Ladder for Controllable Language Models

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

Graded strength of comparative illusions is explained by Bayesian inference

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

Joint discovery of governing partial differential equations from multi-source datasets by competitive optimization

Accelerometry-Derived Digital Biomarkers for Cardiometabolic Risk: A Population-Representative Tabular Benchmark with Uncertainty Quantification

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators

Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts

Hierarchical Global Attention (HGA)

ReactionAtlas: Ab origine exploration of chemical reaction networks with machine learning

Revocable Learned State via Process Sidecars

Predictable GRPO: A Closed-Form Model of Training Dynamics

Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization

Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias

Partition-Guided Distance Saliency: Bridging Decision and Objective Spaces in Many-Objective Optimization

A Stationary-Distribution Theory for Triplet-Based Plateau Search in Random Forest Ensemble-Size Selection