News / #reasoning Tag Reasoning 500 articles archived under #reasoning · RSS Sign in to follow arXiv — Machine Learning research 16d ago ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training arXiv:2606.15682v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats… 35 arXiv — NLP / Computation & Language research 16d ago CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale… 21 arXiv — NLP / Computation & Language research 16d ago Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning arXiv:2606.15007v1 Announce Type: new Abstract: We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context… 19 arXiv — NLP / Computation & Language research 16d ago Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models arXiv:2606.15070v1 Announce Type: new Abstract: By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant… 23 arXiv — NLP / Computation & Language research 16d ago Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale arXiv:2606.15079v1 Announce Type: new Abstract: Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6… 16 arXiv — NLP / Computation & Language research 16d ago AdaMame: A Training Recipe for Adaptive Multilingual Reasoning arXiv:2606.15080v1 Announce Type: new Abstract: While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language… 31 arXiv — NLP / Computation & Language research 16d ago Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes arXiv:2606.15307v1 Announce Type: new Abstract: Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced… 21 arXiv — NLP / Computation & Language research 16d ago Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering arXiv:2606.15419v1 Announce Type: new Abstract: Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM… 22 arXiv — NLP / Computation & Language research 16d ago Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are… 21 arXiv — NLP / Computation & Language research 16d ago ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection arXiv:2606.15770v1 Announce Type: new Abstract: This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two… 31 arXiv — NLP / Computation & Language research 16d ago When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy arXiv:2606.15833v1 Announce Type: new Abstract: Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge… 10 arXiv — NLP / Computation & Language research 16d ago SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks arXiv:2606.15872v1 Announce Type: new Abstract: Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial… 27 arXiv — NLP / Computation & Language research 16d ago Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision arXiv:2606.15877v1 Announce Type: new Abstract: Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects… 8 arXiv — NLP / Computation & Language research 16d ago Neuron Level Analysis of Large Language Model in Legal Domain Reasoning arXiv:2606.15884v1 Announce Type: new Abstract: We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we… 4 arXiv — NLP / Computation & Language research 16d ago Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning arXiv:2606.15972v1 Announce Type: new Abstract: With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer… 30 arXiv — NLP / Computation & Language research 16d ago A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning… 30 arXiv — NLP / Computation & Language research 16d ago From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations arXiv:2606.16047v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated… 7 arXiv — NLP / Computation & Language research 16d ago Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking… 10 arXiv — NLP / Computation & Language research 16d ago GRACE: Step-Level Benchmark for Faithful Reasoning over Context arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can… 15 arXiv — NLP / Computation & Language research 16d ago Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework arXiv:2606.16211v1 Announce Type: new Abstract: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However,… 36 arXiv — NLP / Computation & Language research 16d ago Creative Collision: Directorial Persona Steering and Competition in Large Language Models arXiv:2606.16240v1 Announce Type: new Abstract: Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a \emph{single} semantic direction into the residual stream. We study the richer… 38 arXiv — NLP / Computation & Language research 16d ago Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate arXiv:2606.16360v1 Announce Type: new Abstract: Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead.… 16 arXiv — NLP / Computation & Language research 16d ago A Mechanistic Understanding of Pronoun Fidelity in LLMs arXiv:2606.16407v1 Announce Type: new Abstract: Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in… 35 Hugging Face Daily Papers research 16d ago Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning Abstract Nemotron 3 Ultra is a large-scale language model featuring hybrid Mamba-Attention architecture with 550 billion parameters, achieving high inference throughput and extended context length through specialized training techniques. Generated by… 5 Hugging Face Daily Papers research 16d ago VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models Abstract VibeThinker-3B demonstrates that compact models can achieve state-of-the-art performance on verifiable reasoning tasks through specialized training techniques, challenging conventional scaling assumptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This technical… 16 arXiv — NLP / Computation & Language research 17d ago SuperThoughts: Reasoning Tokens in Superposition arXiv:2606.13862v1 Announce Type: cross Abstract: Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token… 12 arXiv — Machine Learning research 17d ago EM-NeSy: Expectation Maximization for Neurosymbolic Learning arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating… 38 arXiv — Machine Learning research 17d ago When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing arXiv:2606.14668v1 Announce Type: new Abstract: Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a… 36 arXiv — NLP / Computation & Language research 17d ago Which Models Perform Better in Inheritance Reasoning? arXiv:2606.13751v1 Announce Type: new Abstract: This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal… 6 arXiv — NLP / Computation & Language research 17d ago QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning arXiv:2606.13756v1 Announce Type: new Abstract: This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to… 35 arXiv — NLP / Computation & Language research 17d ago Implicit Reasoning for Large Language Model-based Generative Recommendation arXiv:2606.14142v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key… 10 arXiv — NLP / Computation & Language research 17d ago AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition arXiv:2606.14674v1 Announce Type: new Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often… 13 arXiv — NLP / Computation & Language research 17d ago CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving… 34 arXiv — NLP / Computation & Language research 17d ago AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization arXiv:2606.14694v1 Announce Type: new Abstract: Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and… 4 arXiv — NLP / Computation & Language research 17d ago Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs arXiv:2606.13815v1 Announce Type: cross Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the… 37 arXiv — NLP / Computation & Language research 17d ago GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software… 37 arXiv — NLP / Computation & Language research 17d ago ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where… 4 arXiv — NLP / Computation & Language research 17d ago MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We… 23 arXiv — NLP / Computation & Language research 17d ago Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards arXiv:2505.04671v3 Announce Type: replace Abstract: Recent advances in large language models (LLMs) trained with reinforcement learning (RL) have improved Text-to-SQL performance. However, RL-based approaches still struggle with complex queries due to two key limitations:… 18 arXiv — NLP / Computation & Language research 17d ago Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links arXiv:2509.24102v5 Announce Type: replace Abstract: While moral reasoning has emerged as a promising research direction for large language models (LLMs), achieving robust generalization remains a critical challenge. This challenge arises from the gap between what is said and… 27 arXiv — NLP / Computation & Language research 17d ago C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning arXiv:2603.05167v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce… 20 r/MachineLearning community 17d ago I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P] Hey everyone, I built an open-source full-stack pipeline (Django + React) that constructs a Knowledge Graph from raw text, detects thematic communities, and uses hybrid search to solve the "lost in the middle" problem in standard vector retrieval. The Pipeline: Ingestion &… 8 r/LocalLLaMA community 19d ago [NEW FAMILY OF MODELS] Supra1.5 family just released! SupraLabs just released the Supra-1.5-exp line, Base, Instruct, and GGUF! (Reasoning soon) Hey r/LocalLLaMA ! We are releasing the experimental Supra-1.5-50M family today: a new Base model with 5x the context window of the original Supra-50M, an Instruct fine-tune on top of it,… 20 r/LocalLLaMA community 19d ago GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test? Quick initial impressions: - at 70 tok/s slower than GLM 5.1 - seems to spend more time reasoning - better results with my Pac-Man test The one-shot result is almost functional; apart from the ghosts getting stuck immediately after leaving the ghosts house, I did not notice any… 14 r/MachineLearning community 19d ago Price is not cost: how we are using the wrong variable to measure the cost of LLMs [D] Upfront disclosure: this is my write-up (and I'll link it below), but laying out the argument here so you can strawman/steelman it without clicking anything. Assertion 1: per token price is the wrong metric for measuring the cost of work done by LLMs/reasoning models. Users get… 36 r/LocalLLaMA community 19d ago Fable 5 data, including CoT https://huggingface.co/datasets/Glint-Research/Fable-5-traces A simple dataset of all the Fable 5 data we could get our hands on before it was taken away (no clue if it's coming back). Expect some fine-tuned models trained on this soon. Big thanks to the TeichAI team (weird… 20 r/LocalLLaMA community 20d ago MiniMax Sparse Attention (MSA) Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax… 14 NVIDIA Developer Blog official-blog 20d ago Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and... 25 Hugging Face Daily Papers research 20d ago ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages Abstract ArogyaBodha dataset and ArogyaSutra framework enhance multilingual medical reasoning in low-resource settings through diverse data integration and actor-critic multi-agent reasoning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models (MLLMs)… 30 r/LocalLLaMA community 20d ago Has anyone noticed that the behavior of the Kimi model has changed? I have been using Kimi K2.6 in Kimi Code for a while. Although it can complete most tasks, it often requires a long time to think and try. Today the model's CoT has become very short and concise, and it feels much improved on coding tasks compared to before I heard that GLM 5.2… 30 Page 6 of 10 · 500 articles ← Newer Older →