News / #reasoning Tag Reasoning 500 articles archived under #reasoning · RSS Sign in to follow arXiv — NLP / Computation & Language research 2h ago Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds arXiv:2607.00276v1 Announce Type: cross Abstract: Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning… 10 arXiv — Machine Learning research 2h ago Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization arXiv:2607.00531v1 Announce Type: new Abstract: Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based… 22 arXiv — NLP / Computation & Language research 2h ago Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations arXiv:2607.01181v1 Announce Type: cross Abstract: RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only… 25 arXiv — NLP / Computation & Language research 2h ago DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning arXiv:2607.00341v1 Announce Type: new Abstract: Large language models achieve strong performance on many reasoning tasks when allowed to externalize intermediate steps as Chain-of-Thought (CoT). However, many questions require the model to internalize the multi-step reasoning… 32 arXiv — NLP / Computation & Language research 2h ago Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors arXiv:2607.00447v1 Announce Type: new Abstract: Large language models often produce hallucinated answers that violate prompt-level constraints. A key diagnostic question is whether these failures reflect missing knowledge, or whether the model has the relevant information but… 10 arXiv — NLP / Computation & Language research 2h ago Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking arXiv:2607.00482v1 Announce Type: new Abstract: Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers. We show that these behaviors are… 4 arXiv — NLP / Computation & Language research 2h ago Efficient Multilingual Reasoning Transfer via Progressive Code-Switching arXiv:2607.00485v1 Announce Type: new Abstract: Large reasoning models (LRMs) have achieved strong reasoning capabilities in English, yet their performance degrades significantly when required to reason in other languages. A natural solution is to transfer the model's English… 9 arXiv — NLP / Computation & Language research 2h ago CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models arXiv:2607.00862v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have achieved remarkable success on complex tasks by leveraging long chain-of-thought (CoT) trajectories, yet they frequently exhibit overthinking on simple queries, resulting in significant token… 8 arXiv — NLP / Computation & Language research 2h ago Message Passing Enables Efficient Reasoning arXiv:2607.01077v1 Announce Type: new Abstract: While inference-time scaling has improved the reasoning abilities of large language models (LLMs), the need to generate long chains-of-thought (CoTs) is a computational bottleneck. Thus, in contrast to sequential scaling methods… 37 arXiv — NLP / Computation & Language research 2h ago StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning arXiv:2607.00465v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same… 8 arXiv — NLP / Computation & Language research 2h ago MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.… 21 arXiv — NLP / Computation & Language research 2h ago Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination arXiv:2607.00924v1 Announce Type: cross Abstract: Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable… 36 arXiv — NLP / Computation & Language research 2h ago Theoria: Rewrite-Acceptability Verification over Informal Reasoning States arXiv:2607.01223v1 Announce Type: cross Abstract: When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the… 18 arXiv — NLP / Computation & Language research 2h ago Reasoning Up the Instruction Ladder for Controllable Language Models arXiv:2511.04694v5 Announce Type: replace Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources within a single prompt context. Enforcing an instruction… 17 arXiv — NLP / Computation & Language research 2h ago Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents arXiv:2511.07397v3 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller,… 22 r/LocalLLaMA community 12h ago Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ? Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ? Wondering which one would be better at speed / coding / reasoning   submitted by   /u/soyalemujica [link]   [comments] 32 Hugging Face Daily Papers research 14h ago Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning Abstract Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches. Generated by… 11 arXiv — Machine Learning research 1d ago Predictable GRPO: A Closed-Form Model of Training Dynamics arXiv:2606.30789v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has become a standard tool for improving the reasoning ability of large language models, yet its training dynamics are still described empirically: reward trajectories are fit with… 16 arXiv — Machine Learning research 1d ago Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition arXiv:2606.31048v1 Announce Type: new Abstract: This paper investigates knowledge distillation from a large reasoning model (DeepSeek-R1) to a compact student model (Qwen2.5-7B). Using historical problems from the John O'Bryan Mathematics Competition at Northern Kentucky… 7 arXiv — Machine Learning research 1d ago ISM:Self-Improving Strategy Memory for Continual Mathematical Reasoning arXiv:2606.31191v1 Announce Type: new Abstract: We propose Intelligent Schema Memory (ISM), a self-evolving memory-augmented system that improves mathematical reasoning for a frozen LLM under continual learning with hard episodic resets. ISM maintains a compact, self-refined… 34 arXiv — NLP / Computation & Language research 1d ago Fork-Think with Confidence arXiv:2606.31484v1 Announce Type: cross Abstract: Parallel thinking has enjoyed great success for boosting LLM performance on reasoning tasks without the need for any re-training. However, existing methods follow a think-first-then-decide paradigm, i.e., they first sample… 38 arXiv — NLP / Computation & Language research 1d ago Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers arXiv:2606.31779v1 Announce Type: cross Abstract: Language models typically reason via explicit chain-of-thought (CoT), generating intermediate steps token-by-token. Latent CoT offers an alternative: it performs multi-step reasoning in the model's hidden states, replacing… 34 arXiv — NLP / Computation & Language research 1d ago Test-Time Verification for Text-to-SQL via Outcome Reward Models arXiv:2606.30851v1 Announce Type: new Abstract: Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority… 15 arXiv — NLP / Computation & Language research 1d ago Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG arXiv:2606.30989v1 Announce Type: new Abstract: Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive… 5 arXiv — NLP / Computation & Language research 1d ago Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering arXiv:2606.31432v1 Announce Type: new Abstract: Medical multiple-choice question answering requires parameter-efficient adaptation across heterogeneous knowledge domains and reasoning operations. A medication question, a diagnostic decision, a public-health item, and a… 33 arXiv — NLP / Computation & Language research 1d ago CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations… 37 arXiv — NLP / Computation & Language research 1d ago When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models arXiv:2606.30852v1 Announce Type: cross Abstract: Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with… 11 arXiv — NLP / Computation & Language research 1d ago PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding arXiv:2606.31148v1 Announce Type: cross Abstract: 3D Visual Grounding (3DVG) aims to localize target objects in 3D scenes given natural language descriptions. Existing approaches typically perform reasoning over the entire scene, leading to ambiguous predictions and high… 17 arXiv — NLP / Computation & Language research 1d ago HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite… 29 arXiv — NLP / Computation & Language research 1d ago Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2 arXiv:2606.31543v1 Announce Type: cross Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I… 4 arXiv — NLP / Computation & Language research 1d ago From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary arXiv:2506.17294v3 Announce Type: replace Abstract: The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding research area, offering advantages such as scalable availability and personalized narration. However, existing… 17 arXiv — NLP / Computation & Language research 1d ago The Bidirectional Process Reward Model arXiv:2508.01682v3 Announce Type: replace Abstract: Process Reward Models (PRMs), which assign fine-grained scores to intermediate reasoning steps within a solution trajectory, have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs).… 5 arXiv — NLP / Computation & Language research 1d ago Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation arXiv:2512.21002v3 Announce Type: replace Abstract: Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt… 28 arXiv — NLP / Computation & Language research 1d ago What If We Allocate Test-Time Compute Adaptively? arXiv:2602.01070v5 Announce Type: replace Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning… 30 r/MachineLearning community 1d ago Anyone looking into the new MARS2 Workshop/Competition @ ECCV 2026? I saw Tec-do posting it. [D] I recently came across the announcement for the MARS2 Workshop (Multimodal Reasoning Competition) at ECCV 2026. From what I understand, it focuses on multimodal reasoning and test-time reasoning (“slow thinking”), especially applied to video and real-world scenarios like… 30 Hugging Face Daily Papers research 1d ago OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks Abstract OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing computer-use benchmarks… 24 Hugging Face Daily Papers research 2d ago Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning Abstract A new benchmark evaluates multimodal large language models' ability to reason over dynamic visual evidence through controlled temporal-logical operations rather than simple object recognition. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent interest in multimodal… 25 arXiv — Machine Learning research 2d ago What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs arXiv:2606.28615v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these… 31 arXiv — Machine Learning research 2d ago When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling arXiv:2606.28661v1 Announce Type: new Abstract: People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a… 22 arXiv — Machine Learning research 2d ago Invariant Reasoning Directions in Latent Trajectories of Language Models arXiv:2606.29164v1 Announce Type: new Abstract: Latent reasoning models perform multi-step inference directly in hidden-state space, yet the structure of these latent reasoning trajectories remains poorly understood. We show that contrastive refinement signals between stronger… 25 arXiv — Machine Learning research 2d ago Do Models Read What They Write? Causal Registers in Scratchpad Reasoning arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a… 29 arXiv — NLP / Computation & Language research 2d ago EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control arXiv:2606.28938v1 Announce Type: new Abstract: Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle's real-time electro-mechanical state. To bridge this gap, we… 26 arXiv — NLP / Computation & Language research 2d ago ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs arXiv:2606.29067v1 Announce Type: new Abstract: We present ThinkProbe, a framework for structural analysis of LLM reasoning traces. ThinkProbe converts each trace into a Thought Graph a directed graph with cycles, 8 node types, and 6 edge types and derives a 19-metric… 32 arXiv — NLP / Computation & Language research 2d ago Travel-Oriented Reasoning Large Language Model via Domain-Specific Knowledge Graphs arXiv:2606.29254v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate broad reasoning abilities but struggle with accuracy and reliability in specialized domains such as travel, where reasoning depends on precise definitions, rules, and expert-defined… 12 arXiv — NLP / Computation & Language research 2d ago MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling arXiv:2606.29265v1 Announce Type: new Abstract: Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents,… 17 arXiv — NLP / Computation & Language research 2d ago EntroRouter: Learning Efficient Model Routing via Entropy Regulation arXiv:2606.29424v1 Announce Type: new Abstract: Model routing balances solution accuracy and computational cost by selecting among models of varying capabilities. While recent multi-round frameworks interleave reasoning and planning, we identify a structural failure mode termed… 28 arXiv — NLP / Computation & Language research 2d ago To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation arXiv:2606.29481v1 Announce Type: new Abstract: While reinforcement learning (RL) significantly enhances LLM reasoning, its efficacy is severely undermined by Pre-RL data overlap, where RL datasets overlap with pretraining or SFT corpora, causing models to exploit shortcuts by… 11 arXiv — NLP / Computation & Language research 2d ago The Verbose Context Problem in Medical Records arXiv:2606.29503v1 Announce Type: new Abstract: The verbose context problem occurs when structured concepts have token-inefficient textual representations. This bottleneck is acute in population health: cohort-level analysis of longitudinal patient records requires reasoning… 30 arXiv — NLP / Computation & Language research 2d ago Two-Stage Prompt Optimization for Few-Shot Relation Extraction: From Reasoning-Guided Search to Gradient-Guided Refinement arXiv:2606.29639v1 Announce Type: new Abstract: Automatic prompt optimization is still underexplored for episodic few-shot relation extraction with smaller language models. We propose a two-stage framework that combines reasoning-based prompt optimization with gradient-based… 7 arXiv — NLP / Computation & Language research 2d ago Hybrid Retriever Evolution for Multimodal Document Reasoning Agents arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to… 33 Page 1 of 10 · 500 articles Older →