Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — Machine Learning research 23d ago

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

arXiv:2606.08088v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of…

28
Hugging Face Daily Papers research 23d ago

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a…

7
Hugging Face Daily Papers research 24d ago

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Abstract Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.…

22
Hugging Face Daily Papers research 24d ago

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Abstract Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric…

21
r/LocalLLaMA community 24d ago

Nex N2 has a funny "few words do trick" reasoning

I've been playing with Nex N2 Pro (Qwen 3.5 397B finetune) locally today. I noticed straight away that it has a pattern of reasoning that is distinct and uses simple words like "need" and "maybe" a lot. Here's a sample of reasoning. We need answer user asks "what is the theory…

16
Hugging Face Daily Papers research 24d ago

Reinforcement Learning from Rich Feedback with Distributional DAgger

Abstract Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models…

15
Hugging Face Daily Papers research 24d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35
Hugging Face Daily Papers research 24d ago

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Abstract Post-hoc compression of reasoning traces reduces computational costs and inference lengths while maintaining high accuracy, offering an accuracy-efficiency trade-off in knowledge distillation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models produce long…

24
Hugging Face Daily Papers research 24d ago

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Abstract Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search systems iteratively interact…

34
arXiv — Machine Learning research 24d ago

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

arXiv:2606.06902v1 Announce Type: new Abstract: Targeted post-training aims to improve reasoning, math, and code without degrading strengths. Low-rank adapters are efficient but task-global; activation interventions are input-aware but often require separate probes, vectors, or…

21
arXiv — Machine Learning research 24d ago

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv:2606.06920v1 Announce Type: new Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)…

17
arXiv — NLP / Computation & Language research 24d ago

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

arXiv:2606.07006v1 Announce Type: cross Abstract: Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However,…

15
arXiv — Machine Learning research 24d ago

On the Geometry of On-Policy Distillation

arXiv:2606.07082v1 Announce Type: new Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with…

10
arXiv — Machine Learning research 24d ago

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

arXiv:2606.07410v1 Announce Type: new Abstract: The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive…

18
arXiv — NLP / Computation & Language research 24d ago

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

arXiv:2606.06635v1 Announce Type: new Abstract: Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two…

24
arXiv — NLP / Computation & Language research 24d ago

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

arXiv:2606.06646v1 Announce Type: new Abstract: Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text.…

10
arXiv — NLP / Computation & Language research 24d ago

Signal-Driven Observation for Long-Horizon Web Agents

arXiv:2606.06708v1 Announce Type: new Abstract: Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks…

7
arXiv — NLP / Computation & Language research 24d ago

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

arXiv:2606.06745v1 Announce Type: new Abstract: Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework…

25
arXiv — NLP / Computation & Language research 24d ago

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

arXiv:2606.06840v1 Announce Type: new Abstract: Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We…

30
arXiv — NLP / Computation & Language research 24d ago

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

arXiv:2606.06842v1 Announce Type: new Abstract: Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning,…

34
arXiv — NLP / Computation & Language research 24d ago

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

arXiv:2606.06865v1 Announce Type: new Abstract: Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such…

28
arXiv — NLP / Computation & Language research 24d ago

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

arXiv:2606.06915v1 Announce Type: new Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based…

34
arXiv — NLP / Computation & Language research 24d ago

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

arXiv:2606.07054v1 Announce Type: new Abstract: Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate…

22
arXiv — NLP / Computation & Language research 24d ago

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require…

19
arXiv — NLP / Computation & Language research 24d ago

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

arXiv:2606.07190v1 Announce Type: new Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect…

21
arXiv — NLP / Computation & Language research 24d ago

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic…

19
arXiv — NLP / Computation & Language research 24d ago

How reliable are LLMs when it comes to playing dice?

arXiv:2606.07515v1 Announce Type: new Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a…

33
arXiv — NLP / Computation & Language research 24d ago

MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

arXiv:2606.06754v1 Announce Type: cross Abstract: We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable…

10
arXiv — NLP / Computation & Language research 24d ago

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

arXiv:2606.07172v1 Announce Type: cross Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations…

18
arXiv — NLP / Computation & Language research 24d ago

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

arXiv:2606.07512v1 Announce Type: cross Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple…

14
arXiv — NLP / Computation & Language research 24d ago

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

arXiv:2512.13278v2 Announce Type: replace Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which…

10
arXiv — NLP / Computation & Language research 24d ago

SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches

arXiv:2601.09402v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge into the generation process. Benefiting from the reasoning capabilities of LLMs, existing methods have leveraged…

8
arXiv — NLP / Computation & Language research 24d ago

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

arXiv:2602.11201v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or…

22
Hugging Face Daily Papers research 24d ago

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Abstract Astra is an agentic spatial reasoning framework that enhances Vision-Language Models with action-conditioned visual imagination by coupling a reinforcement learning-trained policy with a world simulator for generating novel-view observations. Generated by…

22
Hugging Face Daily Papers research 24d ago

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Abstract Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning. Generated by…

8
Hugging Face Daily Papers research 24d ago

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world…

11
llama.cpp releases dev-tools 26d ago

b9544

common/chat : fix LFM2/LFM2.5 reasoning round-trip and leak ( #24234 ) common/chat : fix LFM2 reasoning round-trip and stray leak Gate by reasoning format and whether the template supports macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled)…

30
r/LocalLLaMA community 26d ago

Z.ai, we need Air! GLM GGUF wen?

First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding. Now GLM 5.1 is a coding beast, but too huge for most to run locally, and even slow on API. Will we ever get another Air model with frontier reasoning and…

23
r/LocalLLaMA community 27d ago

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM…

12
Hugging Face Daily Papers research 27d ago

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Abstract Discrete-WAM introduces a unified discrete latent vision-action world policy that enables compositional causal reasoning and counterfactual reasoning in autonomous driving through aligned discrete tokens and a shared discrete diffusion framework. Generated by…

29
Hugging Face Daily Papers research 27d ago

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Abstract World-language-action models combine textual instruction processing with robot state prediction through an autoregressive transformer backbone, enabling efficient long-horizon task execution and cross-embodiment learning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

7
r/LocalLLaMA community 27d ago

[NEW MODEL] SupraLabs just released a new model! - Supra-50M-Reasoning

SupraLabs just released a new model! - Supra-50M-Reasoning Hello again r/LocalLLaMA ! Supra-50M-Reasoning (ThinkSupra-50M) is the reasoning version of Supra-50M-Instruct. It produces a full thinking chain before every answer, fine-tuned from Supra-50M-Base using a custom…

14
r/LocalLLaMA community 27d ago

Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside)

I completed a Python bug hunting benchmark with Gemma 4 12B. I used the Unsloth Dynamic Q5 GGUF model. The model has good capabilities. Default settings in LM Studio disable the reasoning. Fix the LM Studio reasoning configuration. LM Studio looks for Qwen tokens. Gemma 4 uses…

30
Hugging Face Daily Papers research 27d ago

Multimodal Music Recommendation System using LLMs

Abstract A multimodal framework for session-based music recommendation integrates audio, lyric, and semantic signals with LLM-based sequential reasoning to improve recommendation accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Music recommendation systems typically treat…

16
arXiv — Machine Learning research 27d ago

State commitment learning: training language models to distinguish computation from memory

arXiv:2606.05201v1 Announce Type: new Abstract: Reasoning language models do not distinguish tokens used for computation from tokens that constitute persistent state: once generated, all hidden thoughts remain in context and influence future predictions. As a result, downstream…

19
arXiv — Machine Learning research 27d ago

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

arXiv:2606.05263v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing…

5
arXiv — Machine Learning research 27d ago

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

arXiv:2606.05434v1 Announce Type: new Abstract: Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We…

17
arXiv — Machine Learning research 27d ago

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

arXiv:2606.05533v1 Announce Type: new Abstract: Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning…

13
arXiv — Machine Learning research 27d ago

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

arXiv:2606.05988v1 Announce Type: new Abstract: Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and…

30
arXiv — Machine Learning research 27d ago

HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care

arXiv:2606.05994v1 Announce Type: new Abstract: Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches…

31

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Nex N2 has a funny "few words do trick" reasoning

Reinforcement Learning from Rich Feedback with Distributional DAgger

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

On the Geometry of On-Policy Distillation

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

Signal-Driven Observation for Long-Horizon Web Agents

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

How reliable are LLMs when it comes to playing dice?

MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

b9544

Z.ai, we need Air! GLM GGUF wen?

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

[NEW MODEL] SupraLabs just released a new model! - Supra-50M-Reasoning

Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside)

Multimodal Music Recommendation System using LLMs

State commitment learning: training language models to distinguish computation from memory

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care