Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — NLP / Computation & Language research 2h ago

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

arXiv:2607.00276v1 Announce Type: cross Abstract: Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning…

10
arXiv — Machine Learning research 2h ago

Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

arXiv:2607.00531v1 Announce Type: new Abstract: Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based…

22
arXiv — NLP / Computation & Language research 2h ago

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

arXiv:2607.01181v1 Announce Type: cross Abstract: RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only…

25
arXiv — NLP / Computation & Language research 2h ago

DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning

arXiv:2607.00341v1 Announce Type: new Abstract: Large language models achieve strong performance on many reasoning tasks when allowed to externalize intermediate steps as Chain-of-Thought (CoT). However, many questions require the model to internalize the multi-step reasoning…

32
arXiv — NLP / Computation & Language research 2h ago

Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

arXiv:2607.00447v1 Announce Type: new Abstract: Large language models often produce hallucinated answers that violate prompt-level constraints. A key diagnostic question is whether these failures reflect missing knowledge, or whether the model has the relevant information but…

10
arXiv — NLP / Computation & Language research 2h ago

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

arXiv:2607.00482v1 Announce Type: new Abstract: Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers. We show that these behaviors are…

4
arXiv — NLP / Computation & Language research 2h ago

Efficient Multilingual Reasoning Transfer via Progressive Code-Switching

arXiv:2607.00485v1 Announce Type: new Abstract: Large reasoning models (LRMs) have achieved strong reasoning capabilities in English, yet their performance degrades significantly when required to reason in other languages. A natural solution is to transfer the model's English…

9
arXiv — NLP / Computation & Language research 2h ago

CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models

arXiv:2607.00862v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have achieved remarkable success on complex tasks by leveraging long chain-of-thought (CoT) trajectories, yet they frequently exhibit overthinking on simple queries, resulting in significant token…

8
arXiv — NLP / Computation & Language research 2h ago

Message Passing Enables Efficient Reasoning

arXiv:2607.01077v1 Announce Type: new Abstract: While inference-time scaling has improved the reasoning abilities of large language models (LLMs), the need to generate long chains-of-thought (CoTs) is a computational bottleneck. Thus, in contrast to sequential scaling methods…

37
arXiv — NLP / Computation & Language research 2h ago

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

arXiv:2607.00465v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same…

8
arXiv — NLP / Computation & Language research 2h ago

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.…

21
arXiv — NLP / Computation & Language research 2h ago

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

arXiv:2607.00924v1 Announce Type: cross Abstract: Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable…

36
arXiv — NLP / Computation & Language research 2h ago

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

arXiv:2607.01223v1 Announce Type: cross Abstract: When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the…

18
arXiv — NLP / Computation & Language research 2h ago

Reasoning Up the Instruction Ladder for Controllable Language Models

arXiv:2511.04694v5 Announce Type: replace Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources within a single prompt context. Enforcing an instruction…

17
arXiv — NLP / Computation & Language research 2h ago

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

arXiv:2511.07397v3 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller,…

22
r/LocalLLaMA community 12h ago

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ?

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ? Wondering which one would be better at speed / coding / reasoning   submitted by   /u/soyalemujica [link]   [comments]

32
Hugging Face Daily Papers research 14h ago

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Abstract Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches. Generated by…

11
arXiv — Machine Learning research 1d ago

Predictable GRPO: A Closed-Form Model of Training Dynamics

arXiv:2606.30789v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has become a standard tool for improving the reasoning ability of large language models, yet its training dynamics are still described empirically: reward trajectories are fit with…

16
arXiv — Machine Learning research 1d ago

Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition

arXiv:2606.31048v1 Announce Type: new Abstract: This paper investigates knowledge distillation from a large reasoning model (DeepSeek-R1) to a compact student model (Qwen2.5-7B). Using historical problems from the John O'Bryan Mathematics Competition at Northern Kentucky…

7
arXiv — Machine Learning research 1d ago

ISM:Self-Improving Strategy Memory for Continual Mathematical Reasoning

arXiv:2606.31191v1 Announce Type: new Abstract: We propose Intelligent Schema Memory (ISM), a self-evolving memory-augmented system that improves mathematical reasoning for a frozen LLM under continual learning with hard episodic resets. ISM maintains a compact, self-refined…

34
arXiv — NLP / Computation & Language research 1d ago

Fork-Think with Confidence

arXiv:2606.31484v1 Announce Type: cross Abstract: Parallel thinking has enjoyed great success for boosting LLM performance on reasoning tasks without the need for any re-training. However, existing methods follow a think-first-then-decide paradigm, i.e., they first sample…

38
arXiv — NLP / Computation & Language research 1d ago

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

arXiv:2606.31779v1 Announce Type: cross Abstract: Language models typically reason via explicit chain-of-thought (CoT), generating intermediate steps token-by-token. Latent CoT offers an alternative: it performs multi-step reasoning in the model's hidden states, replacing…

34
arXiv — NLP / Computation & Language research 1d ago

Test-Time Verification for Text-to-SQL via Outcome Reward Models

arXiv:2606.30851v1 Announce Type: new Abstract: Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority…

15
arXiv — NLP / Computation & Language research 1d ago

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

arXiv:2606.30989v1 Announce Type: new Abstract: Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive…

5
arXiv — NLP / Computation & Language research 1d ago

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

arXiv:2606.31432v1 Announce Type: new Abstract: Medical multiple-choice question answering requires parameter-efficient adaptation across heterogeneous knowledge domains and reasoning operations. A medication question, a diagnostic decision, a public-health item, and a…

33
arXiv — NLP / Computation & Language research 1d ago

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations…

37
arXiv — NLP / Computation & Language research 1d ago

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

arXiv:2606.30852v1 Announce Type: cross Abstract: Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with…

11
arXiv — NLP / Computation & Language research 1d ago

PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

arXiv:2606.31148v1 Announce Type: cross Abstract: 3D Visual Grounding (3DVG) aims to localize target objects in 3D scenes given natural language descriptions. Existing approaches typically perform reasoning over the entire scene, leading to ambiguous predictions and high…

17
arXiv — NLP / Computation & Language research 1d ago

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite…

29
arXiv — NLP / Computation & Language research 1d ago

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

arXiv:2606.31543v1 Announce Type: cross Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I…

4
arXiv — NLP / Computation & Language research 1d ago

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

arXiv:2506.17294v3 Announce Type: replace Abstract: The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding research area, offering advantages such as scalable availability and personalized narration. However, existing…

17
arXiv — NLP / Computation & Language research 1d ago

The Bidirectional Process Reward Model

arXiv:2508.01682v3 Announce Type: replace Abstract: Process Reward Models (PRMs), which assign fine-grained scores to intermediate reasoning steps within a solution trajectory, have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs).…

5
arXiv — NLP / Computation & Language research 1d ago

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

arXiv:2512.21002v3 Announce Type: replace Abstract: Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt…

28
arXiv — NLP / Computation & Language research 1d ago

What If We Allocate Test-Time Compute Adaptively?

arXiv:2602.01070v5 Announce Type: replace Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning…

30
r/MachineLearning community 1d ago

Anyone looking into the new MARS2 Workshop/Competition @ ECCV 2026? I saw Tec-do posting it. [D]

I recently came across the announcement for the MARS2 Workshop (Multimodal Reasoning Competition) at ECCV 2026. From what I understand, it focuses on multimodal reasoning and test-time reasoning (“slow thinking”), especially applied to video and real-world scenarios like…

30
Hugging Face Daily Papers research 1d ago

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Abstract OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing computer-use benchmarks…

24
Hugging Face Daily Papers research 2d ago

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Abstract A new benchmark evaluates multimodal large language models' ability to reason over dynamic visual evidence through controlled temporal-logical operations rather than simple object recognition. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent interest in multimodal…

25
arXiv — Machine Learning research 2d ago

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

arXiv:2606.28615v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these…

31
arXiv — Machine Learning research 2d ago

When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling

arXiv:2606.28661v1 Announce Type: new Abstract: People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a…

22
arXiv — Machine Learning research 2d ago

Invariant Reasoning Directions in Latent Trajectories of Language Models

arXiv:2606.29164v1 Announce Type: new Abstract: Latent reasoning models perform multi-step inference directly in hidden-state space, yet the structure of these latent reasoning trajectories remains poorly understood. We show that contrastive refinement signals between stronger…

25
arXiv — Machine Learning research 2d ago

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a…

29
arXiv — NLP / Computation & Language research 2d ago

EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

arXiv:2606.28938v1 Announce Type: new Abstract: Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle's real-time electro-mechanical state. To bridge this gap, we…

26
arXiv — NLP / Computation & Language research 2d ago

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs

arXiv:2606.29067v1 Announce Type: new Abstract: We present ThinkProbe, a framework for structural analysis of LLM reasoning traces. ThinkProbe converts each trace into a Thought Graph a directed graph with cycles, 8 node types, and 6 edge types and derives a 19-metric…

32
arXiv — NLP / Computation & Language research 2d ago

Travel-Oriented Reasoning Large Language Model via Domain-Specific Knowledge Graphs

arXiv:2606.29254v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate broad reasoning abilities but struggle with accuracy and reliability in specialized domains such as travel, where reasoning depends on precise definitions, rules, and expert-defined…

12
arXiv — NLP / Computation & Language research 2d ago

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

arXiv:2606.29265v1 Announce Type: new Abstract: Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents,…

17
arXiv — NLP / Computation & Language research 2d ago

EntroRouter: Learning Efficient Model Routing via Entropy Regulation

arXiv:2606.29424v1 Announce Type: new Abstract: Model routing balances solution accuracy and computational cost by selecting among models of varying capabilities. While recent multi-round frameworks interleave reasoning and planning, we identify a structural failure mode termed…

28
arXiv — NLP / Computation & Language research 2d ago

To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

arXiv:2606.29481v1 Announce Type: new Abstract: While reinforcement learning (RL) significantly enhances LLM reasoning, its efficacy is severely undermined by Pre-RL data overlap, where RL datasets overlap with pretraining or SFT corpora, causing models to exploit shortcuts by…

11
arXiv — NLP / Computation & Language research 2d ago

The Verbose Context Problem in Medical Records

arXiv:2606.29503v1 Announce Type: new Abstract: The verbose context problem occurs when structured concepts have token-inefficient textual representations. This bottleneck is acute in population health: cohort-level analysis of longitudinal patient records requires reasoning…

30
arXiv — NLP / Computation & Language research 2d ago

Two-Stage Prompt Optimization for Few-Shot Relation Extraction: From Reasoning-Guided Search to Gradient-Guided Refinement

arXiv:2606.29639v1 Announce Type: new Abstract: Automatic prompt optimization is still underexplored for episodic few-shot relation extraction with smaller language models. We propose a two-stage framework that combines reasoning-based prompt optimization with gradient-based…

7
arXiv — NLP / Computation & Language research 2d ago

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to…

33

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning

Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

Efficient Multilingual Reasoning Transfer via Progressive Code-Switching

CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models

Message Passing Enables Efficient Reasoning

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Reasoning Up the Instruction Ladder for Controllable Language Models

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ?

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Predictable GRPO: A Closed-Form Model of Training Dynamics

Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition

ISM:Self-Improving Strategy Memory for Continual Mathematical Reasoning

Fork-Think with Confidence

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

Test-Time Verification for Text-to-SQL via Outcome Reward Models

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

The Bidirectional Process Reward Model

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

What If We Allocate Test-Time Compute Adaptively?

Anyone looking into the new MARS2 Workshop/Competition @ ECCV 2026? I saw Tec-do posting it. [D]

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling

Invariant Reasoning Directions in Latent Trajectories of Language Models

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs

Travel-Oriented Reasoning Large Language Model via Domain-Specific Knowledge Graphs

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

EntroRouter: Learning Efficient Model Routing via Entropy Regulation

To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

The Verbose Context Problem in Medical Records

Two-Stage Prompt Optimization for Few-Shot Relation Extraction: From Reasoning-Guided Search to Gradient-Guided Refinement

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents