arXiv — NLP / Computation & Language

500 articles archived · Visit source ↗ · RSS

arXiv — NLP / Computation & Language research 6h ago

Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs

arXiv:2607.01023v1 Announce Type: new Abstract: Financial markets evolve in response to real-world events reported in news, yet these drivers often remain implicit in text. To better explain market dynamics, event-market relations must be explicitly modeled through factual,…

27
arXiv — NLP / Computation & Language research 6h ago

Behavior-Adaptive Conversational Agents: Toward a Fluid Personality Framework

arXiv:2607.01034v1 Announce Type: new Abstract: Large language model (LLM)-based conversational agents (CAs) are now ubiquitous, creating new opportunities for AI-mediated behavior change. Their capacity to project nuanced personalities and adopt diverse metaphorical roles…

38
arXiv — NLP / Computation & Language research 6h ago

Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

arXiv:2607.01047v1 Announce Type: new Abstract: Complexity and interpretability rarely coincide: systems rich enough for complex behaviours to emerge are usually too opaque to question, while transparent ones are too simple for anything complex to emerge. A single large language…

33
arXiv — NLP / Computation & Language research 6h ago

Message Passing Enables Efficient Reasoning

arXiv:2607.01077v1 Announce Type: new Abstract: While inference-time scaling has improved the reasoning abilities of large language models (LLMs), the need to generate long chains-of-thought (CoTs) is a computational bottleneck. Thus, in contrast to sequential scaling methods…

37
arXiv — NLP / Computation & Language research 6h ago

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

arXiv:2607.01103v1 Announce Type: new Abstract: Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration…

12
arXiv — NLP / Computation & Language research 6h ago

Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

arXiv:2607.01115v1 Announce Type: new Abstract: University stakeholders often face difficulties in accessing timely and reliable information, especially in developing countries, where there are very few intelligent support systems. Existing rule-based chatbots are unable to…

15
arXiv — NLP / Computation & Language research 6h ago

$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space

arXiv:2607.01127v1 Announce Type: new Abstract: Quantization has become an invaluable tool to reduce memory requirements and inference speed of modern language models, in particular to make them available for consumer setups and edge devices. While previous work has primarily…

17
arXiv — NLP / Computation & Language research 6h ago

AGC-Bench: Measuring Artificial General Creativity

arXiv:2607.01152v1 Announce Type: new Abstract: Creativity research has debated whether creativity is domain-specific (e.g., visual, writing, science), and if it is psychometrically separable from general intelligence. Both questions now apply to LLMs, but a unified benchmark of…

10
arXiv — NLP / Computation & Language research 6h ago

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded…

14
arXiv — NLP / Computation & Language research 6h ago

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

arXiv:2607.01208v1 Announce Type: new Abstract: Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and…

15
arXiv — NLP / Computation & Language research 6h ago

The State-Prediction Separation Hypothesis

arXiv:2607.01218v1 Announce Type: new Abstract: Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles…

32
arXiv — NLP / Computation & Language research 6h ago

Measuring the Gap Between Human and LLM Research Ideas

arXiv:2607.01233v1 Announce Type: new Abstract: LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human…

8
arXiv — NLP / Computation & Language research 6h ago

Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

arXiv:2507.15692v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect…

22
arXiv — NLP / Computation & Language research 6h ago

Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework

arXiv:2607.00010v1 Announce Type: cross Abstract: Conversational recommender systems (CRSs) are a core component of next-generation intelligent recommender systems because they enable users to actively elicit preferences, clarify intentions, and adapt recommendations in real…

4
arXiv — NLP / Computation & Language research 6h ago

Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

arXiv:2607.00017v1 Announce Type: cross Abstract: Long-term conversational agents are expected to remember past interactions, but memory is useful only when the right evidence is recalled for the right user. Existing memory-augmented LLM agents have made progress in building…

30
arXiv — NLP / Computation & Language research 6h ago

Destination-Labeled Self-Looping Systems with Dwell: Intrinsic Characterization, Realization Cost, and Recognition

arXiv:2607.00044v1 Announce Type: cross Abstract: We study a finite-state symbolic controller for systems in which the admissible visible transitions are fixed in advance and each visible state carries a minimum dwell requirement. The resulting model, which we call a…

32
arXiv — NLP / Computation & Language research 6h ago

CogTax: A Four-Level Cognitive Taxonomy for Command-Line Computing Education

arXiv:2607.00140v1 Announce Type: cross Abstract: As computing education expands beyond traditional programming into operational domains such as systems administration and command-line environments, existing pedagogical frameworks struggle to capture a dimension that is critical…

12
arXiv — NLP / Computation & Language research 6h ago

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

arXiv:2607.00152v1 Announce Type: cross Abstract: Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers…

38
arXiv — NLP / Computation & Language research 6h ago

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

arXiv:2607.00233v1 Announce Type: cross Abstract: How do two agents invent a shared language from scratch? In a Lewis signaling game, a sender and receiver must coordinate on a code using only their interaction history. We study five memory architectures across varying channel…

26
arXiv — NLP / Computation & Language research 6h ago

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

arXiv:2607.00276v1 Announce Type: cross Abstract: Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning…

10
arXiv — NLP / Computation & Language research 6h ago

An LLM-Based Framework for Intent-Driven Network Topology Design

arXiv:2607.00292v1 Announce Type: cross Abstract: Designing deployable and resilient network topologies from natural language requirements remains a challenging problem in network automation. This work investigates the ability of Large Language Models (LLMs) to generate…

35
arXiv — NLP / Computation & Language research 6h ago

Rosetta: Composable Native Multimodal Pretraining

arXiv:2607.00293v1 Announce Type: cross Abstract: Achieving true artificial general intelligence requires foundation models capable of integrating new modalities without forgetting prior knowledge. However, accommodating continuous generative objectives alongside discrete…

15
arXiv — NLP / Computation & Language research 6h ago

EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems

arXiv:2607.00297v1 Announce Type: cross Abstract: When LLM agents use evaluator feedback to adapt their behavior in closed loops, evaluator biases propagate through the agent's strategy distribution -- a phenomenon known as evaluator preference coupling. Prior work has…

37
arXiv — NLP / Computation & Language research 6h ago

Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions

arXiv:2607.00304v1 Announce Type: cross Abstract: The bias-reliability tradeoff conjectures that LLM evaluation systems are constrained in (gamma, H, CV) space, where evaluator coupling (gamma), strategy diversity (H), and small-sample measurement reliability (CV(N)) cannot be…

7
arXiv — NLP / Computation & Language research 6h ago

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

arXiv:2607.00309v1 Announce Type: cross Abstract: We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct…

19
arXiv — NLP / Computation & Language research 6h ago

Watermarking for Proprietary Dataset Protection

arXiv:2607.00325v1 Announce Type: cross Abstract: A growing body of literature suggests that training data membership inference problems are fundamentally hard tasks in modern language modeling settings. We argue that output watermarking techniques are the right gadget to make…

8
arXiv — NLP / Computation & Language research 6h ago

Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

arXiv:2607.00374v1 Announce Type: cross Abstract: Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks…

5
arXiv — NLP / Computation & Language research 6h ago

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

arXiv:2607.00394v1 Announce Type: cross Abstract: LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain largely ad-hoc. We formalize this as an online semantic cache replacement…

22
arXiv — NLP / Computation & Language research 6h ago

NeuroCogMap Reveals Cognitive Organization of Large Language Models

arXiv:2607.00397v1 Announce Type: cross Abstract: Understanding how complex cognitive functions are organized within artificial systems is central to interpreting large language models (LLMs) and relating them to biological cognition. Yet although LLMs exhibit broad…

24
arXiv — NLP / Computation & Language research 6h ago

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

arXiv:2607.00464v1 Announce Type: cross Abstract: Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many…

22
arXiv — NLP / Computation & Language research 6h ago

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

arXiv:2607.00465v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same…

8
arXiv — NLP / Computation & Language research 6h ago

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.…

21
arXiv — NLP / Computation & Language research 6h ago

Self-Evolving Agents with Anytime-Valid Certificates

arXiv:2607.00871v1 Announce Type: cross Abstract: Self-evolving agents violate the assumption behind most learning-theoretic guarantees: the data, evaluator, components, and hypothesis space are produced by the policy being updated. We present \textbf{SEA}, an architecture that…

27
arXiv — NLP / Computation & Language research 6h ago

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

arXiv:2607.00924v1 Announce Type: cross Abstract: Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable…

36
arXiv — NLP / Computation & Language research 6h ago

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

arXiv:2607.01061v1 Announce Type: cross Abstract: Computer-assisted synthesis planning breaks target molecules into accessible precursors using large libraries of reaction rules that assign each transformation a deterministic, interpretable label. But chemistry is long-tailed,…

17
arXiv — NLP / Computation & Language research 6h ago

CausalMix: Data Mixture as Causal Inference for Language Model Training

arXiv:2607.01104v1 Announce Type: cross Abstract: In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As…

31
arXiv — NLP / Computation & Language research 6h ago

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

arXiv:2607.01161v1 Announce Type: cross Abstract: Cross-lingual speaker verification (SV) systems typically exhibit performance degradation when enrollment and test utterances are spoken in different languages. However, standard evaluation protocols confound language mismatch…

16
arXiv — NLP / Computation & Language research 6h ago

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

arXiv:2607.01179v1 Announce Type: cross Abstract: Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference…

36
arXiv — NLP / Computation & Language research 6h ago

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

arXiv:2607.01181v1 Announce Type: cross Abstract: RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only…

25
arXiv — NLP / Computation & Language research 6h ago

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

arXiv:2607.01223v1 Announce Type: cross Abstract: When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the…

18
arXiv — NLP / Computation & Language research 6h ago

AutoMem: Automated Learning of Memory as a Cognitive Skill

arXiv:2607.01224v1 Announce Type: cross Abstract: Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as…

35
arXiv — NLP / Computation & Language research 6h ago

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

arXiv:2607.01232v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically…

7
arXiv — NLP / Computation & Language research 6h ago

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

arXiv:2503.13445v3 Announce Type: replace Abstract: When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work,…

4
arXiv — NLP / Computation & Language research 6h ago

GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

arXiv:2507.05740v2 Announce Type: replace Abstract: Language models are powerful artifacts, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely…

30
arXiv — NLP / Computation & Language research 6h ago

Toward Cybersecurity-Expert Small Language Models

arXiv:2510.14113v2 Announce Type: replace Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal…

30
arXiv — NLP / Computation & Language research 6h ago

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

arXiv:2510.24434v3 Announce Type: replace Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning…

10
arXiv — NLP / Computation & Language research 6h ago

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

arXiv:2510.24636v3 Announce Type: replace Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and…

34
arXiv — NLP / Computation & Language research 6h ago

Reasoning Up the Instruction Ladder for Controllable Language Models

arXiv:2511.04694v5 Announce Type: replace Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources within a single prompt context. Enforcing an instruction…

17
arXiv — NLP / Computation & Language research 6h ago

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

arXiv:2511.07397v3 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller,…

22
arXiv — NLP / Computation & Language research 6h ago

Graded strength of comparative illusions is explained by Bayesian inference

arXiv:2511.14642v2 Announce Type: replace Abstract: Like visual processing, language processing is susceptible to illusions in which people systematically misperceive stimuli. In one such case--the comparative illusion (CI), e.g., More students have been to Russia than I…

33

Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs

Behavior-Adaptive Conversational Agents: Toward a Fluid Personality Framework

Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

Message Passing Enables Efficient Reasoning

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space

AGC-Bench: Measuring Artificial General Creativity

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

The State-Prediction Separation Hypothesis

Measuring the Gap Between Human and LLM Research Ideas

Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework

Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

Destination-Labeled Self-Looping Systems with Dwell: Intrinsic Characterization, Realization Cost, and Recognition

CogTax: A Four-Level Cognitive Taxonomy for Command-Line Computing Education

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

An LLM-Based Framework for Intent-Driven Network Topology Design

Rosetta: Composable Native Multimodal Pretraining

EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems

Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

Watermarking for Proprietary Dataset Protection

Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

NeuroCogMap Reveals Cognitive Organization of Large Language Models

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

Self-Evolving Agents with Anytime-Valid Certificates

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

CausalMix: Data Mixture as Causal Inference for Language Model Training

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

AutoMem: Automated Learning of Memory as a Cognitive Skill

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

Toward Cybersecurity-Expert Small Language Models

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Reasoning Up the Instruction Ladder for Controllable Language Models

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

Graded strength of comparative illusions is explained by Bayesian inference