Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — Machine Learning research 16d ago

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

arXiv:2606.15682v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats…

35
arXiv — NLP / Computation & Language research 16d ago

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale…

21
arXiv — NLP / Computation & Language research 16d ago

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

arXiv:2606.15007v1 Announce Type: new Abstract: We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context…

19
arXiv — NLP / Computation & Language research 16d ago

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

arXiv:2606.15070v1 Announce Type: new Abstract: By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant…

23
arXiv — NLP / Computation & Language research 16d ago

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

arXiv:2606.15079v1 Announce Type: new Abstract: Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6…

16
arXiv — NLP / Computation & Language research 16d ago

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

arXiv:2606.15080v1 Announce Type: new Abstract: While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language…

31
arXiv — NLP / Computation & Language research 16d ago

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

arXiv:2606.15307v1 Announce Type: new Abstract: Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced…

21
arXiv — NLP / Computation & Language research 16d ago

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

arXiv:2606.15419v1 Announce Type: new Abstract: Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM…

22
arXiv — NLP / Computation & Language research 16d ago

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are…

21
arXiv — NLP / Computation & Language research 16d ago

ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

arXiv:2606.15770v1 Announce Type: new Abstract: This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two…

31
arXiv — NLP / Computation & Language research 16d ago

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

arXiv:2606.15833v1 Announce Type: new Abstract: Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge…

10
arXiv — NLP / Computation & Language research 16d ago

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

arXiv:2606.15872v1 Announce Type: new Abstract: Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial…

27
arXiv — NLP / Computation & Language research 16d ago

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

arXiv:2606.15877v1 Announce Type: new Abstract: Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects…

8
arXiv — NLP / Computation & Language research 16d ago

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

arXiv:2606.15884v1 Announce Type: new Abstract: We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we…

4
arXiv — NLP / Computation & Language research 16d ago

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

arXiv:2606.15972v1 Announce Type: new Abstract: With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer…

30
arXiv — NLP / Computation & Language research 16d ago

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning…

30
arXiv — NLP / Computation & Language research 16d ago

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

arXiv:2606.16047v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated…

7
arXiv — NLP / Computation & Language research 16d ago

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking…

10
arXiv — NLP / Computation & Language research 16d ago

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can…

15
arXiv — NLP / Computation & Language research 16d ago

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

arXiv:2606.16211v1 Announce Type: new Abstract: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However,…

36
arXiv — NLP / Computation & Language research 16d ago

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

arXiv:2606.16240v1 Announce Type: new Abstract: Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a \emph{single} semantic direction into the residual stream. We study the richer…

38
arXiv — NLP / Computation & Language research 16d ago

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

arXiv:2606.16360v1 Announce Type: new Abstract: Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead.…

16
arXiv — NLP / Computation & Language research 16d ago

A Mechanistic Understanding of Pronoun Fidelity in LLMs

arXiv:2606.16407v1 Announce Type: new Abstract: Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in…

35
Hugging Face Daily Papers research 16d ago

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Abstract Nemotron 3 Ultra is a large-scale language model featuring hybrid Mamba-Attention architecture with 550 billion parameters, achieving high inference throughput and extended context length through specialized training techniques. Generated by…

5
Hugging Face Daily Papers research 16d ago

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Abstract VibeThinker-3B demonstrates that compact models can achieve state-of-the-art performance on verifiable reasoning tasks through specialized training techniques, challenging conventional scaling assumptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This technical…

16
arXiv — NLP / Computation & Language research 17d ago

SuperThoughts: Reasoning Tokens in Superposition

arXiv:2606.13862v1 Announce Type: cross Abstract: Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token…

12
arXiv — Machine Learning research 17d ago

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating…

38
arXiv — Machine Learning research 17d ago

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

arXiv:2606.14668v1 Announce Type: new Abstract: Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a…

36
arXiv — NLP / Computation & Language research 17d ago

Which Models Perform Better in Inheritance Reasoning?

arXiv:2606.13751v1 Announce Type: new Abstract: This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal…

6
arXiv — NLP / Computation & Language research 17d ago

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

arXiv:2606.13756v1 Announce Type: new Abstract: This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to…

35
arXiv — NLP / Computation & Language research 17d ago

Implicit Reasoning for Large Language Model-based Generative Recommendation

arXiv:2606.14142v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key…

10
arXiv — NLP / Computation & Language research 17d ago

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

arXiv:2606.14674v1 Announce Type: new Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often…

13
arXiv — NLP / Computation & Language research 17d ago

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving…

34
arXiv — NLP / Computation & Language research 17d ago

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

arXiv:2606.14694v1 Announce Type: new Abstract: Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and…

4
arXiv — NLP / Computation & Language research 17d ago

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

arXiv:2606.13815v1 Announce Type: cross Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the…

37
arXiv — NLP / Computation & Language research 17d ago

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software…

37
arXiv — NLP / Computation & Language research 17d ago

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where…

4
arXiv — NLP / Computation & Language research 17d ago

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We…

23
arXiv — NLP / Computation & Language research 17d ago

Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards

arXiv:2505.04671v3 Announce Type: replace Abstract: Recent advances in large language models (LLMs) trained with reinforcement learning (RL) have improved Text-to-SQL performance. However, RL-based approaches still struggle with complex queries due to two key limitations:…

18
arXiv — NLP / Computation & Language research 17d ago

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

arXiv:2509.24102v5 Announce Type: replace Abstract: While moral reasoning has emerged as a promising research direction for large language models (LLMs), achieving robust generalization remains a critical challenge. This challenge arises from the gap between what is said and…

27
arXiv — NLP / Computation & Language research 17d ago

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

arXiv:2603.05167v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce…

20
r/MachineLearning community 17d ago

I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]

Hey everyone, I built an open-source full-stack pipeline (Django + React) that constructs a Knowledge Graph from raw text, detects thematic communities, and uses hybrid search to solve the "lost in the middle" problem in standard vector retrieval. The Pipeline: Ingestion &…

8
r/LocalLLaMA community 19d ago

[NEW FAMILY OF MODELS] Supra1.5 family just released!

SupraLabs just released the Supra-1.5-exp line, Base, Instruct, and GGUF! (Reasoning soon) Hey r/LocalLLaMA ! We are releasing the experimental Supra-1.5-50M family today: a new Base model with 5x the context window of the original Supra-50M, an Instruct fine-tune on top of it,…

20
r/LocalLLaMA community 19d ago

GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test?

Quick initial impressions: - at 70 tok/s slower than GLM 5.1 - seems to spend more time reasoning - better results with my Pac-Man test The one-shot result is almost functional; apart from the ghosts getting stuck immediately after leaving the ghosts house, I did not notice any…

14
r/MachineLearning community 19d ago

Price is not cost: how we are using the wrong variable to measure the cost of LLMs [D]

Upfront disclosure: this is my write-up (and I'll link it below), but laying out the argument here so you can strawman/steelman it without clicking anything. Assertion 1: per token price is the wrong metric for measuring the cost of work done by LLMs/reasoning models. Users get…

36
r/LocalLLaMA community 19d ago

Fable 5 data, including CoT

https://huggingface.co/datasets/Glint-Research/Fable-5-traces A simple dataset of all the Fable 5 data we could get our hands on before it was taken away (no clue if it's coming back). Expect some fine-tuned models trained on this soon. Big thanks to the TeichAI team (weird…

20
r/LocalLLaMA community 20d ago

MiniMax Sparse Attention (MSA)

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax…

14
NVIDIA Developer Blog official-blog 20d ago

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and...

25
Hugging Face Daily Papers research 20d ago

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Abstract ArogyaBodha dataset and ArogyaSutra framework enhance multilingual medical reasoning in low-resource settings through diverse data integration and actor-critic multi-agent reasoning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models (MLLMs)…

30
r/LocalLLaMA community 20d ago

Has anyone noticed that the behavior of the Kimi model has changed?

I have been using Kimi K2.6 in Kimi Code for a while. Although it can complete most tasks, it often requires a long time to think and try. Today the model's CoT has become very short and concise, and it feels much improved on coding tasks compared to before I heard that GLM 5.2…

30

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

A Mechanistic Understanding of Pronoun Fidelity in LLMs

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

SuperThoughts: Reasoning Tokens in Superposition

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

Which Models Perform Better in Inheritance Reasoning?

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

Implicit Reasoning for Large Language Model-based Generative Recommendation

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]

[NEW FAMILY OF MODELS] Supra1.5 family just released!

GLM 5.2 is out - open weights to be released next week. How did it do on my one-shot Pac-Man test?

Price is not cost: how we are using the wrong variable to measure the cost of LLMs [D]

Fable 5 data, including CoT

MiniMax Sparse Attention (MSA)

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Has anyone noticed that the behavior of the Kimi model has changed?