arXiv — NLP / Computation & Language

500 articles archived · Visit source ↗ · RSS

arXiv — NLP / Computation & Language research 1d ago

LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

arXiv:2606.31947v1 Announce Type: new Abstract: State-of-the-art speech datasets predominantly focus on widely spoken languages, often overlooking low-resource languages such as Luxembourgish, which remain underrepresented in speech technology research. In this work, we…

25
arXiv — NLP / Computation & Language research 1d ago

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

arXiv:2606.31980v1 Announce Type: new Abstract: Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions…

36
arXiv — NLP / Computation & Language research 1d ago

Scalable Behaviour Cloning on Browser Using via Skill Distillation

arXiv:2606.32014v1 Announce Type: new Abstract: Internet users collectively perform an enormous range of skilled work through web browsers, from software development and document editing to search, forms, and enterprise workflows, making human browsing a highly scalable but…

16
arXiv — NLP / Computation & Language research 1d ago

Generative Skill Composition for LLM Agents

arXiv:2606.32025v1 Announce Type: new Abstract: Recent LLM agents benefit from skills for solving complex tasks. Skills encapsulate modular packages of procedural knowledge and instructions for performing specialized tasks, such as setting up a sandboxed environment, running a…

34
arXiv — NLP / Computation & Language research 1d ago

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

arXiv:2606.32029v1 Announce Type: new Abstract: While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer…

29
arXiv — NLP / Computation & Language research 1d ago

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

arXiv:2606.32032v1 Announce Type: new Abstract: Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with…

34
arXiv — NLP / Computation & Language research 1d ago

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

arXiv:2606.32038v1 Announce Type: new Abstract: When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their…

30
arXiv — NLP / Computation & Language research 1d ago

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

arXiv:2606.30646v1 Announce Type: cross Abstract: Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia…

18
arXiv — NLP / Computation & Language research 1d ago

Emergent Culture in Minimal LLM Systems

arXiv:2606.30668v1 Announce Type: cross Abstract: What happens when LLM agents operate with no context outside a turn, minimal prompting, and simple tools? Inspired by swarm engineering, we give collectives of three agents the ability to send messages and manipulate a shared…

22
arXiv — NLP / Computation & Language research 1d ago

ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models

arXiv:2606.30696v1 Announce Type: cross Abstract: Enabling robots to follow natural language commands to complete zero-shot long-horizon tasks remains challenging. It requires extracting implicit temporal and logical constraints from natural language commands and executing…

4
arXiv — NLP / Computation & Language research 1d ago

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators

arXiv:2606.30704v1 Announce Type: cross Abstract: Large language models (LLMs) excel across a wide range of tasks, yet their instance-specific solutions often lack the structural consistency needed for reliable deployment. Workflows that encode recurring algorithmic patterns at…

13
arXiv — NLP / Computation & Language research 1d ago

Revocable Learned State via Process Sidecars

arXiv:2606.30788v1 Announce Type: cross Abstract: Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not…

17
arXiv — NLP / Computation & Language research 1d ago

Information Terra: A Narrative-Anchored Semantic-First Projection of Document Embeddings

arXiv:2606.30824v1 Announce Type: cross Abstract: We introduce Information Terra, a narrative-anchored semantic-first projection that places a document corpus on an Earth-like globe whose poles are two user-chosen endpoint documents and whose prime meridian is the great-circle…

28
arXiv — NLP / Computation & Language research 1d ago

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

arXiv:2606.30852v1 Announce Type: cross Abstract: Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with…

11
arXiv — NLP / Computation & Language research 1d ago

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

arXiv:2606.31002v1 Announce Type: cross Abstract: Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean…

35
arXiv — NLP / Computation & Language research 1d ago

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

arXiv:2606.31054v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive…

37
arXiv — NLP / Computation & Language research 1d ago

Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021

arXiv:2606.31081v1 Announce Type: cross Abstract: The present study analyzed over 26,000 research articles published between 1991 and 2021 in twenty-one major LIS (Library and Information Science) journals, using the machine learning (ML) approach to categorize the research…

5
arXiv — NLP / Computation & Language research 1d ago

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

arXiv:2606.31128v1 Announce Type: cross Abstract: Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion…

30
arXiv — NLP / Computation & Language research 1d ago

PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

arXiv:2606.31148v1 Announce Type: cross Abstract: 3D Visual Grounding (3DVG) aims to localize target objects in 3D scenes given natural language descriptions. Existing approaches typically perform reasoning over the entire scene, leading to ambiguous predictions and high…

17
arXiv — NLP / Computation & Language research 1d ago

ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

arXiv:2606.31163v1 Announce Type: cross Abstract: Large language models deployed in regulated industries operate under two constraints: compliance enforcement and cost efficiency. Personally identifiable information (PII) in user queries can reach model endpoints before the…

14
arXiv — NLP / Computation & Language research 1d ago

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite…

29
arXiv — NLP / Computation & Language research 1d ago

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

arXiv:2606.31270v1 Announce Type: cross Abstract: Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these…

20
arXiv — NLP / Computation & Language research 1d ago

The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills

arXiv:2606.31272v1 Announce Type: cross Abstract: AI agents increasingly acquire and execute skills at runtime: bundles of prompt instructions, executable code, and tool declarations fetched from marketplaces and other agents. Governing them needs a stable notion of skill…

16
arXiv — NLP / Computation & Language research 1d ago

Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

arXiv:2606.31371v1 Announce Type: cross Abstract: When large language model (LLM) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent's learned strategy distribution - a phenomenon termed evaluator preference coupling.…

38
arXiv — NLP / Computation & Language research 1d ago

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

arXiv:2606.31407v1 Announce Type: cross Abstract: Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows…

15
arXiv — NLP / Computation & Language research 1d ago

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

arXiv:2606.31435v1 Announce Type: cross Abstract: Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or…

38
arXiv — NLP / Computation & Language research 1d ago

Fork-Think with Confidence

arXiv:2606.31484v1 Announce Type: cross Abstract: Parallel thinking has enjoyed great success for boosting LLM performance on reasoning tasks without the need for any re-training. However, existing methods follow a think-first-then-decide paradigm, i.e., they first sample…

38
arXiv — NLP / Computation & Language research 1d ago

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

arXiv:2606.31511v1 Announce Type: cross Abstract: In deployment settings where retraining is infeasible, small frozen code models are routinely asked to repair a failed program after seeing their own failing output, usually treated as a retry mechanism. From a Popperian view, a…

9
arXiv — NLP / Computation & Language research 1d ago

RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference

arXiv:2606.31519v1 Announce Type: cross Abstract: Long-context Large Language Model inference is severely bottlenecked by the massive Key-Value (KV) cache, yet existing sparse attention methods often suffer from static fixed-budget (Top-k) retrieval or rely on proxy scores that…

5
arXiv — NLP / Computation & Language research 1d ago

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

arXiv:2606.31543v1 Announce Type: cross Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I…

4
arXiv — NLP / Computation & Language research 1d ago

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

arXiv:2606.31693v1 Announce Type: cross Abstract: The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation…

38
arXiv — NLP / Computation & Language research 1d ago

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

arXiv:2606.31694v1 Announce Type: cross Abstract: For robots manipulating open-world objects, tactile representations must generalize to unseen materials. We introduce RCT (Robotic Contact Tactile), a robot-collected touch-vision-language dataset with 29,279 tactile frames from…

18
arXiv — NLP / Computation & Language research 1d ago

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

arXiv:2606.31779v1 Announce Type: cross Abstract: Language models typically reason via explicit chain-of-thought (CoT), generating intermediate steps token-by-token. Latent CoT offers an alternative: it performs multi-step reasoning in the model's hidden states, replacing…

34
arXiv — NLP / Computation & Language research 1d ago

SpikeLogBERT: Energy-Efficient Log Parsing Using Spiking Transformer Networks

arXiv:2606.31781v1 Announce Type: cross Abstract: Log parsing is a fundamental step in automated log analysis, transforming raw system logs into structured event templates for downstream tasks such as anomaly detection and system monitoring. Existing log parsing methods range…

17
arXiv — NLP / Computation & Language research 1d ago

Review Residuals: Update-Conditioned Residual Gating for Transformers

arXiv:2606.31859v1 Announce Type: cross Abstract: Residual connections add every sublayer's proposed update with a fixed coefficient of one; the network never evaluates whether an update is reliable before committing it. Drawing on the human-factors principle of independent…

23
arXiv — NLP / Computation & Language research 1d ago

Signed-Permutation Coordinate Transport for RMSNorm Transformers

arXiv:2606.31963v1 Announce Type: cross Abstract: Modern LLM workflows move coordinate-indexed objects across checkpoints: steering vectors, sparse autoencoders, top-$k$ neuron sets, attribution lists, and merge alignments. This is only well posed after fixing the model's…

37
arXiv — NLP / Computation & Language research 1d ago

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

arXiv:2606.31966v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a…

4
arXiv — NLP / Computation & Language research 1d ago

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

arXiv:2606.32022v1 Announce Type: cross Abstract: Residual-stream analysis asks how language-model computation evolves across depth, but intermediate decoding requires comparable readout coordinates across layers. If embedding anchors and unembedding readout disagree on the…

23
arXiv — NLP / Computation & Language research 1d ago

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

arXiv:2606.32034v1 Announce Type: cross Abstract: LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the…

36
arXiv — NLP / Computation & Language research 1d ago

Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models

arXiv:2410.12341v4 Announce Type: replace Abstract: As AI-generated content increasingly populates the web, generative AI models are at growing risk of being trained on their own outputs, a process known as AI autophagy. This feedback loop has been shown to induce model…

16
arXiv — NLP / Computation & Language research 1d ago

Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

arXiv:2502.15845v2 Announce Type: replace Abstract: Large Language Models (LLMs) often hallucinate, limiting their reliability in sensitive applications. In black-box settings, several self-consistency-based techniques have been proposed for hallucination detection. We…

29
arXiv — NLP / Computation & Language research 1d ago

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

arXiv:2504.07385v3 Announce Type: replace Abstract: As Large Language Models (LLMs) become increasingly used for question-answering (QA), relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. Meanwhile,…

26
arXiv — NLP / Computation & Language research 1d ago

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

arXiv:2506.17294v3 Announce Type: replace Abstract: The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding research area, offering advantages such as scalable availability and personalized narration. However, existing…

17
arXiv — NLP / Computation & Language research 1d ago

The Bidirectional Process Reward Model

arXiv:2508.01682v3 Announce Type: replace Abstract: Process Reward Models (PRMs), which assign fine-grained scores to intermediate reasoning steps within a solution trajectory, have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs).…

5
arXiv — NLP / Computation & Language research 1d ago

Rethinking On-policy Optimization for Query Augmentation

arXiv:2510.17139v3 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or…

28
arXiv — NLP / Computation & Language research 1d ago

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

arXiv:2512.21002v3 Announce Type: replace Abstract: Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt…

28
arXiv — NLP / Computation & Language research 1d ago

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

arXiv:2601.04126v3 Announce Type: replace Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present…

29
arXiv — NLP / Computation & Language research 1d ago

What If We Allocate Test-Time Compute Adaptively?

arXiv:2602.01070v5 Announce Type: replace Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning…

30
arXiv — NLP / Computation & Language research 1d ago

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

arXiv:2602.06625v2 Announce Type: replace Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and…

7
arXiv — NLP / Computation & Language research 1d ago

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

arXiv:2603.19453v3 Announce Type: replace Abstract: We propose an LLM harness that generates code-based policy functions for multi-agent environments, evaluates them with self-play, and refines them using feedback from previous iterations. Following the recent line of work in…

28

LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

Scalable Behaviour Cloning on Browser Using via Skill Distillation

Generative Skill Composition for LLM Agents

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

Emergent Culture in Minimal LLM Systems

ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators

Revocable Learned State via Process Sidecars

Information Terra: A Narrative-Anchored Semantic-First Projection of Document Embeddings

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills

Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

Fork-Think with Confidence

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

SpikeLogBERT: Energy-Efficient Log Parsing Using Spiking Transformer Networks

Review Residuals: Update-Conditioned Residual Gating for Transformers

Signed-Permutation Coordinate Transport for RMSNorm Transformers

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models

Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

The Bidirectional Process Reward Model

Rethinking On-policy Optimization for Query Augmentation

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

What If We Allocate Test-Time Compute Adaptively?

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas