arXiv — NLP / Computation & Language

500 articles archived · Visit source ↗ · RSS

arXiv — NLP / Computation & Language research 1d ago

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

arXiv:2606.30775v1 Announce Type: new Abstract: Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term…

25
arXiv — NLP / Computation & Language research 1d ago

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

arXiv:2606.30790v1 Announce Type: new Abstract: Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs)…

26
arXiv — NLP / Computation & Language research 1d ago

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

arXiv:2606.30801v1 Announce Type: new Abstract: Personalization algorithms determine what content users encounter on online platforms. Auditing these systems is difficult because independent auditors have only black-box access to the algorithms, while personalization depends on…

37
arXiv — NLP / Computation & Language research 1d ago

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

arXiv:2606.30814v1 Announce Type: new Abstract: Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration…

21
arXiv — NLP / Computation & Language research 1d ago

When transformers learn "impossible" languages, what do they learn?

arXiv:2606.30815v1 Announce Type: new Abstract: Recent work suggests that transformer language models show a bias towards human languages over unnatural ("impossible") languages argued to be unacquirable by humans. However, this literature has largely based these claims on…

34
arXiv — NLP / Computation & Language research 1d ago

Test-Time Verification for Text-to-SQL via Outcome Reward Models

arXiv:2606.30851v1 Announce Type: new Abstract: Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority…

15
arXiv — NLP / Computation & Language research 1d ago

Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning

arXiv:2606.30857v1 Announce Type: new Abstract: This paper describes our submission to SemEval-2026 Task 9 on detecting multilingual, multicultural, and multievent online polarization. We address all three subtasks: binary polarization detection, polarization type…

4
arXiv — NLP / Computation & Language research 1d ago

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

arXiv:2606.30887v1 Announce Type: new Abstract: Large language models show promise for mental health support, yet therapeutic quality improves only when evaluation functions as an actionable control signal rather than a passive metric. We introduce a framework that formulates…

32
arXiv — NLP / Computation & Language research 1d ago

Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

arXiv:2606.30914v1 Announce Type: new Abstract: Event detection (ED) systems are typically evaluated on clean, curated text, leaving their robustness to real-world noise largely unexplored, particularly for low-resource languages such as Bangla. We introduce a generalized Bangla…

17
arXiv — NLP / Computation & Language research 1d ago

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

arXiv:2606.30943v1 Announce Type: new Abstract: Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research results between these communities, which affects international collaboration and the progress of…

8
arXiv — NLP / Computation & Language research 1d ago

Linguistic Distancing on Social Media: Indicators of Emotion Regulation Across Age Groups

arXiv:2606.30957v1 Announce Type: new Abstract: Managing our emotional responses to events is key to emotional well-being, a process referred to as emotion regulation in psychology. Previous work has established that the degree to which we distance events is a type of emotion…

8
arXiv — NLP / Computation & Language research 1d ago

From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue

arXiv:2606.30973v1 Announce Type: new Abstract: Frictive Policy Optimization (FPO; Pustejovsky et al., 2025) treats friction in collaborative dialogue -- misalignment, misunderstanding, repair -- as an epistemic signal essential to common-ground construction, rather than noise…

18
arXiv — NLP / Computation & Language research 1d ago

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

arXiv:2606.30987v1 Announce Type: new Abstract: Decision-makers routinely rely on expert judgments accompanied by written explanations, yet explanation quality is difficult to measure at scale. Forecasting tournaments offer a natural testing ground: probabilistic judgments are…

6
arXiv — NLP / Computation & Language research 1d ago

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

arXiv:2606.30989v1 Announce Type: new Abstract: Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive…

5
arXiv — NLP / Computation & Language research 1d ago

CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations

arXiv:2606.31033v1 Announce Type: new Abstract: In this paper, we propose CORTEX, a token-level hallucination detection method for Retrieval-Augmented Generation (RAG). In long-form RAG outputs, hallucinations often arise in localized spans rather than throughout an entire…

20
arXiv — NLP / Computation & Language research 1d ago

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

arXiv:2606.31039v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong semantic capabilities, yet their resilience to manipulative linguistic patterns such as logical fallacies remains underexplored. Prior work has primarily examined whether LLMs can…

12
arXiv — NLP / Computation & Language research 1d ago

A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

arXiv:2606.31041v1 Announce Type: new Abstract: Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names,…

12
arXiv — NLP / Computation & Language research 1d ago

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with…

7
arXiv — NLP / Computation & Language research 1d ago

Exploring the relationship between team institutional composition and novelty in academic papers based on fine-grained knowledge entities

arXiv:2606.31058v1 Announce Type: new Abstract: The composition of author teams is an important factor influencing the novelty of academic papers. However, existing studies have paid limited attention to the role of institutional composition, and most novelty measures remain at…

22
arXiv — NLP / Computation & Language research 1d ago

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

arXiv:2606.31069v1 Announce Type: new Abstract: Up to this point, keyword extraction task typically relies solely on textual data. Neglecting visual details and audio features from image and audio modalities leads to deficiencies in information richness and overlooks potential…

14
arXiv — NLP / Computation & Language research 1d ago

Triospect: A Three-Dimensional Framework for Robust Statistical AI-Generated Text Detection Against Diverse Attacks

arXiv:2606.31074v1 Announce Type: new Abstract: Existing AI-generated text detectors are vulnerable to attacks that manipulate textual characteristics. In this study, we propose a novel Triospect Detection Framework by using additional perspectives of content (core ideas) and…

37
arXiv — NLP / Computation & Language research 1d ago

When Reranking Hurts: Uncertainty-Based Gating for Few-Shot Reranking

arXiv:2606.31087v1 Announce Type: new Abstract: Few-shot selection typically assumes that reranking retrieved examples always improves performance. We challenge this view by identifying that the expensive reranking step can in fact degrade performance. Instead, we propose…

4
arXiv — NLP / Computation & Language research 1d ago

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

arXiv:2606.31112v1 Announce Type: new Abstract: ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including…

31
arXiv — NLP / Computation & Language research 1d ago

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

arXiv:2606.31145v1 Announce Type: new Abstract: Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and must be retained throughout decoding, making full GPU caching…

11
arXiv — NLP / Computation & Language research 1d ago

TAG-DLM: Diffusion Language Models for Text-Attributed Graph Learning

arXiv:2606.31166v1 Announce Type: new Abstract: Text-attributed graphs (TAGs), where each node carries a natural language description, require models to jointly reason over text and graph topology. Existing approaches often handle the two modalities separately: graph neural…

8
arXiv — NLP / Computation & Language research 1d ago

Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimer's Disease Detection

arXiv:2606.31186v1 Announce Type: new Abstract: Spontaneous speech is a vital non-invasive biomarker for Alzheimer's Disease (AD), yet many systems overlook non-linear structural disruptions and clinical heterogeneity in pathological language. We propose a Multi-View Gated Graph…

31
arXiv — NLP / Computation & Language research 1d ago

Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas?

arXiv:2606.31213v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed as moral advisors and agents, they need to address dilemmas between two competing values. However, existing research on LLMs with moral dilemmas overlooks a central aspect…

11
arXiv — NLP / Computation & Language research 1d ago

Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law

arXiv:2606.31250v1 Announce Type: new Abstract: Large language models (LLM) trained on web-scale corpora generate output that may infringe copyright, yet existing technical safeguards focus narrowly on verbatim memorisation. EU copyright doctrine applies a broader standards:…

36
arXiv — NLP / Computation & Language research 1d ago

When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue

arXiv:2606.31307v1 Announce Type: new Abstract: Large language models used in task-oriented dialogue often produce fluent but unsafe responses when backend database calls fail, return empty results, or surface mismatched information, inventing venues, confirmations, or booking…

16
arXiv — NLP / Computation & Language research 1d ago

LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

arXiv:2606.31310v1 Announce Type: new Abstract: Fueled by increasing model scale and multimodal inputs, Multimodal Large Language Models (MLLMs) have emerged as a promising paradigm for Spoken Language Assessment (SLA). While effective, this paradigm often overlooks the…

9
arXiv — NLP / Computation & Language research 1d ago

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

arXiv:2606.31315v1 Announce Type: new Abstract: Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens in parallel, and are then verified by the target model, enabling lossless acceleration. Recently, diffusion-based…

20
arXiv — NLP / Computation & Language research 1d ago

Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

arXiv:2606.31411v1 Announce Type: new Abstract: Rapid advancements in generative speech technology have compromised the reliability of voice biometrics. While current spoofing detectors excel when assessed under in-domain conditions, generalisation to out-of-domain settings is…

4
arXiv — NLP / Computation & Language research 1d ago

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

arXiv:2606.31432v1 Announce Type: new Abstract: Medical multiple-choice question answering requires parameter-efficient adaptation across heterogeneous knowledge domains and reasoning operations. A medication question, a diagnostic decision, a public-health item, and a…

33
arXiv — NLP / Computation & Language research 1d ago

Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

arXiv:2606.31446v1 Announce Type: new Abstract: RVL-CDIP is a popular dataset for benchmarking document classifiers. However, the dataset contains ample amounts of label errors as well as non-trivial amounts of test-train overlap, both of which may impact model performance…

25
arXiv — NLP / Computation & Language research 1d ago

Team MKC at CLPsych 2026: Capturing and Characterizing Mental Health Changes through Social Media Timeline Dynamics

arXiv:2606.31464v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) have motivated their adoption across a wide range of domains, including Artificial Intelligence (AI) for mental health. Given the growing prevalence of mental health disorders…

19
arXiv — NLP / Computation & Language research 1d ago

Building an ASR Solution for Training and Assessing Children's Reading

arXiv:2606.31508v1 Announce Type: new Abstract: Automatic speech recognition for children's reading remains underdeveloped for most African languages, including Bambara, despite its potential value for reproducible literacy assessment. We present an open-source system for…

30
arXiv — NLP / Computation & Language research 1d ago

FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

arXiv:2606.31522v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision…

19
arXiv — NLP / Computation & Language research 1d ago

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

arXiv:2606.31551v1 Announce Type: new Abstract: Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that…

37
arXiv — NLP / Computation & Language research 1d ago

Robust Text Watermarking for Large Language Models via Dual Semantic Embeddings

arXiv:2606.31602v1 Announce Type: new Abstract: This work presents Dual-Embedding Watermarking (DEW), a semantic watermarking scheme for large language models (LLMs) that leverages contextual and token-level embeddings to enhance robustness against paraphrasing and translation.…

8
arXiv — NLP / Computation & Language research 1d ago

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations…

37
arXiv — NLP / Computation & Language research 1d ago

Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition

arXiv:2606.31642v1 Announce Type: new Abstract: Southern Bantu languages are spoken by over 80 million people, yet current foundation ASR models still produce zero-shot WER above 100%, which limits practical use in education and public services. We addressed this gap with a tone…

18
arXiv — NLP / Computation & Language research 1d ago

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

arXiv:2606.31644v1 Announce Type: new Abstract: As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations…

5
arXiv — NLP / Computation & Language research 1d ago

Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management

arXiv:2606.31692v1 Announce Type: new Abstract: This paper presents an overview of the second edition of the TalentCLEF challenge, organized as a Lab at the Conference and Labs of the Evaluation Forum (CLEF) 2026. TalentCLEF is an initiative aimed at advancing Natural Language…

19
arXiv — NLP / Computation & Language research 1d ago

Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian

arXiv:2606.31718v1 Announce Type: new Abstract: Relation extraction (RE) for low-resource languages is typically constrained by the lack of annotated corpora. We investigate the feasibility of cross-lingual RE for Romanian by combining automatic dataset translation with large…

38
arXiv — NLP / Computation & Language research 1d ago

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

arXiv:2606.31719v1 Announce Type: new Abstract: In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be…

22
arXiv — NLP / Computation & Language research 1d ago

Adapting Foundation ASR Models to Dysarthric Speech: A Case Study

arXiv:2606.31722v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems often perform poorly in dysarthric speech, limiting their usefulness to affected speakers in everyday communication. This paper presents a personalized ASR system for a dysarthric speaker,…

11
arXiv — NLP / Computation & Language research 1d ago

STEB: Style Text Embedding Benchmark

arXiv:2606.31741v1 Announce Type: new Abstract: While semantic embeddings are rigorously evaluated on the Massive Text Embedding Benchmark, the evaluation of style embeddings remains fragmented, with each work relying on their own set of tasks and datasets. To bridge this gap,…

27
arXiv — NLP / Computation & Language research 1d ago

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

arXiv:2606.31796v1 Announce Type: new Abstract: We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output…

14
arXiv — NLP / Computation & Language research 1d ago

Explicit Fuzzy Logic in the Feed-Forward Layer: Self-Forgetting Quantifiers Discover Legible Grammatical-Licensing Detectors

arXiv:2606.31845v1 Announce Type: new Abstract: A transformer's feed-forward (FFN) sublayer materializes the distinctions attention gathers, yet gives no account of what it computes. In a parameter-neutral replacement, each hidden unit is an explicit fuzzy set operation on…

35
arXiv — NLP / Computation & Language research 1d ago

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

arXiv:2606.31916v1 Announce Type: new Abstract: Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper…

25

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

When transformers learn "impossible" languages, what do they learn?

Test-Time Verification for Text-to-SQL via Outcome Reward Models

Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

Linguistic Distancing on Social Media: Indicators of Emotion Regulation Across Age Groups

From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

Exploring the relationship between team institutional composition and novelty in academic papers based on fine-grained knowledge entities

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

Triospect: A Three-Dimensional Framework for Robust Statistical AI-Generated Text Detection Against Diverse Attacks

When Reranking Hurts: Uncertainty-Based Gating for Few-Shot Reranking

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

TAG-DLM: Diffusion Language Models for Text-Attributed Graph Learning

Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimer's Disease Detection

Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas?

Probing Stylistic Appropriation using Large Language Models: An Evaluation Framework for Copyright Infringement under EU Law

When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue

LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

Team MKC at CLPsych 2026: Capturing and Characterizing Mental Health Changes through Social Media Timeline Dynamics

Building an ASR Solution for Training and Assessing Children's Reading

FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

Robust Text Watermarking for Large Language Models via Dual Semantic Embeddings

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management

Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

Adapting Foundation ASR Models to Dysarthric Speech: A Case Study

STEB: Style Text Embedding Benchmark

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

Explicit Fuzzy Logic in the Feed-Forward Layer: Self-Forgetting Quantifiers Discover Legible Grammatical-Licensing Detectors

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action