News / #reasoning Tag Reasoning 500 articles archived under #reasoning · RSS Sign in to follow arXiv — NLP / Computation & Language research 2d ago How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning arXiv:2606.29672v1 Announce Type: new Abstract: Evaluating the originality of visual images poses enduring challenges for creativity assessment. Automated scoring using AI models has proven effective in the verbal domain, yet key questions remain about evaluating visual… 11 arXiv — NLP / Computation & Language research 2d ago Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against… 8 arXiv — NLP / Computation & Language research 2d ago Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression arXiv:2606.29712v1 Announce Type: new Abstract: Large language models achieve high reasoning performance via explicit chain-of-thought and reinforcement learning, but require long output sequences and extended inference time. Latent reasoning reduces this cost by shifting… 22 arXiv — NLP / Computation & Language research 2d ago Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical… 10 arXiv — NLP / Computation & Language research 2d ago LatentRevise: Learning from Zero-Hit Reasoning arXiv:2606.29938v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) is bottlenecked by hard prompts on which correct trajectories have low probability, so sampling misses them within a practical budget and leaves the policy update with little… 10 arXiv — NLP / Computation & Language research 2d ago Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning arXiv:2606.29985v1 Announce Type: new Abstract: Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing… 27 arXiv — NLP / Computation & Language research 2d ago Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs arXiv:2606.30093v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by grounding the generation process on external knowledge. However, standard RAG approaches struggle with multi-hop reasoning. While… 10 arXiv — NLP / Computation & Language research 2d ago DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning arXiv:2606.30189v1 Announce Type: new Abstract: Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world… 14 Hugging Face Daily Papers research 2d ago ReasoningLens: Hierarchical Visualization and Diagnostic Auditing for Large Reasoning Models Abstract ReasoningLens is an open-source framework that provides hierarchical visualization and diagnostic auditing for complex reasoning chains in large reasoning models, enabling structured analysis and error detection through interactive hierarchies and automated auditing.… 21 Hugging Face Daily Papers research 2d ago PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents Abstract POLICYGUARD is a sub-agent verifier that enhances LLM agent policy adherence by providing contextual reasoning and conversation-specific feedback across multi-turn interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents handle user requests on behalf of… 11 Hugging Face Daily Papers research 2d ago Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction Abstract Epi2Diff framework transforms LRM reasoning traces into cognitive episodes to predict human item difficulty more accurately than existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predicting human item difficulty is central to educational assessment, where… 8 LangChain releases dev-tools 2d ago langchain-openrouter==0.2.5 Changes since langchain-openrouter==0.2.4 release(openrouter): 0.2.5 ( #38553 ) fix(openrouter): deduplicate repeated finish metadata ( #38552 ) fix(openrouter): strip Responses reasoning IDs ( #38383 ) 32 arXiv — Machine Learning research 3d ago The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment arXiv:2606.27739v1 Announce Type: new Abstract: Process reward models (PRMs) enhance the reasoning capabilities of large language models (LLMs) by providing fine-grained feedback, yet training PRMs typically requires expensive stepwise annotations. Outcome-supervised PRMs offer… 18 arXiv — Machine Learning research 3d ago COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives arXiv:2606.28194v1 Announce Type: new Abstract: While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on… 18 arXiv — Machine Learning research 3d ago Democratic ICAI: Debating Our Way to Steering Principles from Preferences arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the… 38 arXiv — NLP / Computation & Language research 3d ago When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume… 27 arXiv — NLP / Computation & Language research 3d ago ToxiREX: A Dataset on Toxic REasoning in ConteXt arXiv:2606.27981v1 Announce Type: new Abstract: We introduce a new, contextual, multilingual dataset called ToxiREX: Toxic REasoning in ConteXt. The dataset consists of threads of Reddit comments and structured characterizations of what the comments imply, following a systematic… 5 arXiv — NLP / Computation & Language research 3d ago Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction arXiv:2606.28186v1 Announce Type: new Abstract: Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual… 35 arXiv — NLP / Computation & Language research 3d ago EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from… 34 arXiv — NLP / Computation & Language research 3d ago SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,… 31 llama.cpp releases dev-tools 3d ago b9837 jinja, chat: add --reasoning-preserve flag ( #25105 ) jinja, chat: add --reasoning-preserve flag correct help message macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu… 28 llama.cpp releases dev-tools 3d ago b9835 ui: fix stop and reasoning skip in single-model mode ( #25084 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan)… 15 r/LocalLLaMA community 4d ago I built a tool to turn your Claude Code sessions into fine-tuning data for local models If you use Claude Code, every session is already sitting on disk as a .jsonl file under ~/.claude/projects/ . It has real coding conversations: multi-turn edits, tool calls, reasoning traces. That's training data you already generated for free. The problem is the format is not… 36 r/MachineLearning community 4d ago MathFormer: Testing whether symbolic math is pattern matching or reasoning [D] Repo link and results - https://github.com/Abhinand20/MathFormer Task: Given a factorized expression like (7-3*z)*(-5*z-9), predict the expanded form -> 15*z\*2-8\*z-63 Key takeaway: A tiny (4M param) seq2seq model trained with no math knowledge reaches ~98.6% accuracy on… 7 Hugging Face Daily Papers research 5d ago Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation Abstract A unified agentic framework called Qwen-Image-Agent is proposed to address the context gap in text-to-image generation by progressively constructing complete generation context through planning, reasoning, searching, and memory mechanisms. Generated by… 22 Hugging Face Daily Papers research 5d ago Information-Aware KV Cache Compression for Long Reasoning Abstract InfoKV is an entropy-aware KV cache compression framework that enhances long-context reasoning in LLMs by incorporating information-theoretic signals alongside attention weights. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning capability has advanced rapidly in… 10 Hugging Face Daily Papers research 5d ago Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments Abstract A web-based benchmark evaluates agent generalization across challenging scenarios, revealing significant gaps between current agentic systems and human performance in temporal perception, graphical understanding, and 3D reasoning. Generated by… 10 Hugging Face Daily Papers research 5d ago PhysiFormer: Learning to Simulate Mechanics in World Space Abstract PhysiFormer uses coordinate-space diffusion to generate physically-plausible 3D object motions without explicit inductive biases, enabling efficient multi-object reasoning and generalization to complex materials and geometries. Generated by… 30 r/LocalLLaMA community 6d ago Does llama cpp split mode tensor cause issues? I split qwen 27b and Gemma 4 26b (moe) across a 5080, and 2x 5060ti. I noticed setting split mode to tensor mode will cause looping issues in OpenCode with tool calls or just through the reasoning traces. Anyone else get this or understand why? Split mode layer seems to work… 25 Hugging Face Daily Papers research 6d ago How Post-Training Shapes Biological Reasoning Models Abstract Post-training stages in biological reasoning models differently affect generalization, with continued pre-training aligning models with biological language, supervised fine-tuning improving in-domain performance but reducing out-of-domain generalization, and… 8 arXiv — NLP / Computation & Language research 6d ago Epiphany-Aware KV Cache Eviction Without the Attention Matrix arXiv:2606.26472v1 Announce Type: cross Abstract: As reasoning models emit chains of thought tens of thousands of tokens long, KV cache increasingly becomes a deployment bottleneck. Existing cache eviction methods rank tokens by attention weight, which is a noisy importance… 21 arXiv — Machine Learning research 6d ago Retrieval-Warmed Energy-Based Reasoning: A Five-Arm Ablation Methodology for Diffusion-as-Inference on Structured Reasoning Tasks arXiv:2606.26476v1 Announce Type: new Abstract: Warm-started diffusion samplers accelerate iterative inference, but it is rarely clear which part of the pipeline carries the gain. We study \textbf{retrieval-warmed energy-based reasoning (RW-EBR)} -- an IRED energy-based… 9 arXiv — Machine Learning research 6d ago What Survives When You Compress a Recursive Reasoner for the Edge? arXiv:2606.26488v1 Announce Type: new Abstract: Recursive reasoning models can solve complex structured tasks with only a few million parameters by repeatedly updating a latent state. Deploying these models on edge hardware requires significant compression, but unlike… 30 arXiv — Machine Learning research 6d ago Reasoning Quality Emerges Early: Data Curation for Reasoning Models arXiv:2606.26797v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating… 14 arXiv — NLP / Computation & Language research 6d ago Context Recycling for Long-Horizon LLM Inference arXiv:2606.26105v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong capabilities in short-context reasoning but degrade in performance over long conversational horizons due to context window limitations and inefficient token usage. We introduce… 27 arXiv — NLP / Computation & Language research 6d ago Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare arXiv:2606.26104v1 Announce Type: new Abstract: Animal-welfare advocates produce a lot of writing, and increasingly that writing trains the language models that millions of people then ask about animal welfare. Using vocabulary-matched stance-contrast probes on a held-out… 19 arXiv — NLP / Computation & Language research 6d ago Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning arXiv:2606.26108v1 Announce Type: new Abstract: Larger language models consistently outperform smaller ones on reasoning benchmarks, yet the reasoning differences underlying this gap remain underexplored. Across benchmarks in mathematics, physics, chemistry, and programming, we… 35 arXiv — NLP / Computation & Language research 6d ago From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's… 12 arXiv — NLP / Computation & Language research 6d ago Soft Token Alignment for Cross-Lingual Reasoning arXiv:2606.26466v1 Announce Type: new Abstract: Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively… 5 arXiv — NLP / Computation & Language research 6d ago \textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models arXiv:2606.26530v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC;~\citealp{chollet2019measure}) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches… 22 arXiv — NLP / Computation & Language research 6d ago Zero-shot Tweet-Level Stance Detection Enhanced by External Knowledge and Reflective Chain-of-Thought Reasoning arXiv:2606.26571v1 Announce Type: new Abstract: Zero-shot tweet-level stance detection confronts two primary challenges: (1) mitigating the context sparsity inherent in short texts, and (2) establishing the relevance between implicit targets and textual content. While existing… 35 arXiv — NLP / Computation & Language research 6d ago Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification arXiv:2606.26698v1 Announce Type: new Abstract: In today's fast-paced information era, logical fallacies, defined as defective patterns of reasoning, inevitably contribute to the growth of information disorder. However, often fallacies appear in nuanced forms that complicate… 37 arXiv — NLP / Computation & Language research 6d ago Information-Aware KV Cache Compression for Long Reasoning arXiv:2606.26875v1 Announce Type: new Abstract: Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention… 15 arXiv — NLP / Computation & Language research 6d ago ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models arXiv:2606.26986v1 Announce Type: new Abstract: Open Relation Extraction (OpenRE) requires a model to extract unseen relations between head and tail entities from unstructured text for real-world applications. The core challenge of OpenRE lies in achieving reliable… 13 arXiv — NLP / Computation & Language research 6d ago Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization arXiv:2606.27025v1 Announce Type: new Abstract: Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep,… 16 arXiv — NLP / Computation & Language research 6d ago The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans arXiv:2606.27103v1 Announce Type: new Abstract: Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern… 9 arXiv — NLP / Computation & Language research 6d ago Multilingual Reasoning Cascades Need More Context arXiv:2606.27306v1 Announce Type: new Abstract: Translation cascades for reasoning translate the query from another language to English, reason in English, and translate the answer back to the original language. This is a competitive approach to multilingual reasoning, but… 7 arXiv — NLP / Computation & Language research 6d ago The Verification Horizon: No Silver Bullet for Coding Agent Rewards arXiv:2606.26300v1 Announce Type: cross Abstract: A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering… 24 arXiv — NLP / Computation & Language research 6d ago Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models arXiv:2606.26366v1 Announce Type: cross Abstract: Standard chain-of-thought on moral dilemmas exhibits two failure modes: stakeholder collapse (the trace names at most one party with a stake in the outcome) and uncertainty suppression (no explicit unknowns or hedges before… 29 arXiv — NLP / Computation & Language research 6d ago Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to… 19 Page 2 of 10 · 500 articles ← Newer Older →