Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — NLP / Computation & Language research 2d ago

How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning

arXiv:2606.29672v1 Announce Type: new Abstract: Evaluating the originality of visual images poses enduring challenges for creativity assessment. Automated scoring using AI models has proven effective in the verbal domain, yet key questions remain about evaluating visual…

11
arXiv — NLP / Computation & Language research 2d ago

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against…

8
arXiv — NLP / Computation & Language research 2d ago

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

arXiv:2606.29712v1 Announce Type: new Abstract: Large language models achieve high reasoning performance via explicit chain-of-thought and reinforcement learning, but require long output sequences and extended inference time. Latent reasoning reduces this cost by shifting…

22
arXiv — NLP / Computation & Language research 2d ago

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical…

10
arXiv — NLP / Computation & Language research 2d ago

LatentRevise: Learning from Zero-Hit Reasoning

arXiv:2606.29938v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) is bottlenecked by hard prompts on which correct trajectories have low probability, so sampling misses them within a practical budget and leaves the policy update with little…

10
arXiv — NLP / Computation & Language research 2d ago

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

arXiv:2606.29985v1 Announce Type: new Abstract: Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing…

27
arXiv — NLP / Computation & Language research 2d ago

Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs

arXiv:2606.30093v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by grounding the generation process on external knowledge. However, standard RAG approaches struggle with multi-hop reasoning. While…

10
arXiv — NLP / Computation & Language research 2d ago

DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning

arXiv:2606.30189v1 Announce Type: new Abstract: Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world…

14
Hugging Face Daily Papers research 2d ago

ReasoningLens: Hierarchical Visualization and Diagnostic Auditing for Large Reasoning Models

Abstract ReasoningLens is an open-source framework that provides hierarchical visualization and diagnostic auditing for complex reasoning chains in large reasoning models, enabling structured analysis and error detection through interactive hierarchies and automated auditing.…

21
Hugging Face Daily Papers research 2d ago

PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents

Abstract POLICYGUARD is a sub-agent verifier that enhances LLM agent policy adherence by providing contextual reasoning and conversation-specific feedback across multi-turn interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents handle user requests on behalf of…

11
Hugging Face Daily Papers research 2d ago

Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction

Abstract Epi2Diff framework transforms LRM reasoning traces into cognitive episodes to predict human item difficulty more accurately than existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predicting human item difficulty is central to educational assessment, where…

8
LangChain releases dev-tools 2d ago

langchain-openrouter==0.2.5

Changes since langchain-openrouter==0.2.4 release(openrouter): 0.2.5 ( #38553 ) fix(openrouter): deduplicate repeated finish metadata ( #38552 ) fix(openrouter): strip Responses reasoning IDs ( #38383 )

32
arXiv — Machine Learning research 3d ago

The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment

arXiv:2606.27739v1 Announce Type: new Abstract: Process reward models (PRMs) enhance the reasoning capabilities of large language models (LLMs) by providing fine-grained feedback, yet training PRMs typically requires expensive stepwise annotations. Outcome-supervised PRMs offer…

18
arXiv — Machine Learning research 3d ago

COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives

arXiv:2606.28194v1 Announce Type: new Abstract: While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on…

18
arXiv — Machine Learning research 3d ago

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the…

38
arXiv — NLP / Computation & Language research 3d ago

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume…

27
arXiv — NLP / Computation & Language research 3d ago

ToxiREX: A Dataset on Toxic REasoning in ConteXt

arXiv:2606.27981v1 Announce Type: new Abstract: We introduce a new, contextual, multilingual dataset called ToxiREX: Toxic REasoning in ConteXt. The dataset consists of threads of Reddit comments and structured characterizations of what the comments imply, following a systematic…

5
arXiv — NLP / Computation & Language research 3d ago

Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction

arXiv:2606.28186v1 Announce Type: new Abstract: Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual…

35
arXiv — NLP / Computation & Language research 3d ago

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from…

34
arXiv — NLP / Computation & Language research 3d ago

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,…

31
llama.cpp releases dev-tools 3d ago

b9837

jinja, chat: add --reasoning-preserve flag ( #25105 ) jinja, chat: add --reasoning-preserve flag correct help message macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu…

28
llama.cpp releases dev-tools 3d ago

b9835

ui: fix stop and reasoning skip in single-model mode ( #25084 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan)…

15
r/LocalLLaMA community 4d ago

I built a tool to turn your Claude Code sessions into fine-tuning data for local models

If you use Claude Code, every session is already sitting on disk as a .jsonl file under ~/.claude/projects/ . It has real coding conversations: multi-turn edits, tool calls, reasoning traces. That's training data you already generated for free. The problem is the format is not…

36
r/MachineLearning community 4d ago

MathFormer: Testing whether symbolic math is pattern matching or reasoning [D]

Repo link and results - https://github.com/Abhinand20/MathFormer Task: Given a factorized expression like (7-3*z)*(-5*z-9), predict the expanded form -> 15*z\*2-8\*z-63 Key takeaway: A tiny (4M param) seq2seq model trained with no math knowledge reaches ~98.6% accuracy on…

7
Hugging Face Daily Papers research 5d ago

Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation

Abstract A unified agentic framework called Qwen-Image-Agent is proposed to address the context gap in text-to-image generation by progressively constructing complete generation context through planning, reasoning, searching, and memory mechanisms. Generated by…

22
Hugging Face Daily Papers research 5d ago

Information-Aware KV Cache Compression for Long Reasoning

Abstract InfoKV is an entropy-aware KV cache compression framework that enhances long-context reasoning in LLMs by incorporating information-theoretic signals alongside attention weights. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning capability has advanced rapidly in…

10
Hugging Face Daily Papers research 5d ago

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Abstract A web-based benchmark evaluates agent generalization across challenging scenarios, revealing significant gaps between current agentic systems and human performance in temporal perception, graphical understanding, and 3D reasoning. Generated by…

10
Hugging Face Daily Papers research 5d ago

PhysiFormer: Learning to Simulate Mechanics in World Space

Abstract PhysiFormer uses coordinate-space diffusion to generate physically-plausible 3D object motions without explicit inductive biases, enabling efficient multi-object reasoning and generalization to complex materials and geometries. Generated by…

30
r/LocalLLaMA community 6d ago

Does llama cpp split mode tensor cause issues?

I split qwen 27b and Gemma 4 26b (moe) across a 5080, and 2x 5060ti. I noticed setting split mode to tensor mode will cause looping issues in OpenCode with tool calls or just through the reasoning traces. Anyone else get this or understand why? Split mode layer seems to work…

25
Hugging Face Daily Papers research 6d ago

How Post-Training Shapes Biological Reasoning Models

Abstract Post-training stages in biological reasoning models differently affect generalization, with continued pre-training aligning models with biological language, supervised fine-tuning improving in-domain performance but reducing out-of-domain generalization, and…

8
arXiv — NLP / Computation & Language research 6d ago

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

arXiv:2606.26472v1 Announce Type: cross Abstract: As reasoning models emit chains of thought tens of thousands of tokens long, KV cache increasingly becomes a deployment bottleneck. Existing cache eviction methods rank tokens by attention weight, which is a noisy importance…

21
arXiv — Machine Learning research 6d ago

Retrieval-Warmed Energy-Based Reasoning: A Five-Arm Ablation Methodology for Diffusion-as-Inference on Structured Reasoning Tasks

arXiv:2606.26476v1 Announce Type: new Abstract: Warm-started diffusion samplers accelerate iterative inference, but it is rarely clear which part of the pipeline carries the gain. We study \textbf{retrieval-warmed energy-based reasoning (RW-EBR)} -- an IRED energy-based…

9
arXiv — Machine Learning research 6d ago

What Survives When You Compress a Recursive Reasoner for the Edge?

arXiv:2606.26488v1 Announce Type: new Abstract: Recursive reasoning models can solve complex structured tasks with only a few million parameters by repeatedly updating a latent state. Deploying these models on edge hardware requires significant compression, but unlike…

30
arXiv — Machine Learning research 6d ago

Reasoning Quality Emerges Early: Data Curation for Reasoning Models

arXiv:2606.26797v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating…

14
arXiv — NLP / Computation & Language research 6d ago

Context Recycling for Long-Horizon LLM Inference

arXiv:2606.26105v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong capabilities in short-context reasoning but degrade in performance over long conversational horizons due to context window limitations and inefficient token usage. We introduce…

27
arXiv — NLP / Computation & Language research 6d ago

Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare

arXiv:2606.26104v1 Announce Type: new Abstract: Animal-welfare advocates produce a lot of writing, and increasingly that writing trains the language models that millions of people then ask about animal welfare. Using vocabulary-matched stance-contrast probes on a held-out…

19
arXiv — NLP / Computation & Language research 6d ago

Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning

arXiv:2606.26108v1 Announce Type: new Abstract: Larger language models consistently outperform smaller ones on reasoning benchmarks, yet the reasoning differences underlying this gap remain underexplored. Across benchmarks in mathematics, physics, chemistry, and programming, we…

35
arXiv — NLP / Computation & Language research 6d ago

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's…

12
arXiv — NLP / Computation & Language research 6d ago

Soft Token Alignment for Cross-Lingual Reasoning

arXiv:2606.26466v1 Announce Type: new Abstract: Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively…

5
arXiv — NLP / Computation & Language research 6d ago

\textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models

arXiv:2606.26530v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC;~\citealp{chollet2019measure}) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches…

22
arXiv — NLP / Computation & Language research 6d ago

Zero-shot Tweet-Level Stance Detection Enhanced by External Knowledge and Reflective Chain-of-Thought Reasoning

arXiv:2606.26571v1 Announce Type: new Abstract: Zero-shot tweet-level stance detection confronts two primary challenges: (1) mitigating the context sparsity inherent in short texts, and (2) establishing the relevance between implicit targets and textual content. While existing…

35
arXiv — NLP / Computation & Language research 6d ago

Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification

arXiv:2606.26698v1 Announce Type: new Abstract: In today's fast-paced information era, logical fallacies, defined as defective patterns of reasoning, inevitably contribute to the growth of information disorder. However, often fallacies appear in nuanced forms that complicate…

37
arXiv — NLP / Computation & Language research 6d ago

Information-Aware KV Cache Compression for Long Reasoning

arXiv:2606.26875v1 Announce Type: new Abstract: Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention…

15
arXiv — NLP / Computation & Language research 6d ago

ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models

arXiv:2606.26986v1 Announce Type: new Abstract: Open Relation Extraction (OpenRE) requires a model to extract unseen relations between head and tail entities from unstructured text for real-world applications. The core challenge of OpenRE lies in achieving reliable…

13
arXiv — NLP / Computation & Language research 6d ago

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

arXiv:2606.27025v1 Announce Type: new Abstract: Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep,…

16
arXiv — NLP / Computation & Language research 6d ago

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

arXiv:2606.27103v1 Announce Type: new Abstract: Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern…

9
arXiv — NLP / Computation & Language research 6d ago

Multilingual Reasoning Cascades Need More Context

arXiv:2606.27306v1 Announce Type: new Abstract: Translation cascades for reasoning translate the query from another language to English, reason in English, and translate the answer back to the original language. This is a competitive approach to multilingual reasoning, but…

7
arXiv — NLP / Computation & Language research 6d ago

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

arXiv:2606.26300v1 Announce Type: cross Abstract: A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering…

24
arXiv — NLP / Computation & Language research 6d ago

Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models

arXiv:2606.26366v1 Announce Type: cross Abstract: Standard chain-of-thought on moral dilemmas exhibits two failure modes: stakeholder collapse (the trace names at most one party with a stake in the outcome) and uncertainty suppression (no explicit unknowns or hedges before…

29
arXiv — NLP / Computation & Language research 6d ago

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to…

19

How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

LatentRevise: Learning from Zero-Hit Reasoning

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs

DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning

ReasoningLens: Hierarchical Visualization and Diagnostic Auditing for Large Reasoning Models

PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents

Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction

langchain-openrouter==0.2.5

The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment

COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

ToxiREX: A Dataset on Toxic REasoning in ConteXt

Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

b9837

b9835

I built a tool to turn your Claude Code sessions into fine-tuning data for local models

MathFormer: Testing whether symbolic math is pattern matching or reasoning [D]

Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation

Information-Aware KV Cache Compression for Long Reasoning

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

PhysiFormer: Learning to Simulate Mechanics in World Space

Does llama cpp split mode tensor cause issues?

How Post-Training Shapes Biological Reasoning Models

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

Retrieval-Warmed Energy-Based Reasoning: A Five-Arm Ablation Methodology for Diffusion-as-Inference on Structured Reasoning Tasks

What Survives When You Compress a Recursive Reasoner for the Edge?

Reasoning Quality Emerges Early: Data Curation for Reasoning Models

Context Recycling for Long-Horizon LLM Inference

Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare

Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

Soft Token Alignment for Cross-Lingual Reasoning

\textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models

Zero-shot Tweet-Level Stance Detection Enhanced by External Knowledge and Reflective Chain-of-Thought Reasoning

Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification

Information-Aware KV Cache Compression for Long Reasoning

ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

Multilingual Reasoning Cascades Need More Context

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs