Reasoning Up the Instruction Ladder for Controllable Language Models
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:Reasoning Up the Instruction Ladder for Controllable Language Models
Abstract:As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources within a single prompt context. Enforcing an instruction hierarchy, where higher-level directives override lower-priority requests, is critical to the reliability and control of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. The model must first "think" about the relationship between a given user prompt and higher-priority instructions before generating a response. To enable this capability, we construct VerIH, a training dataset of constraint-following tasks with verifiable answers, comprising aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our method leads to consistent improvements across multiple model families on both instruction following and instruction hierarchy benchmarks, achieving ~20% absolute improvement in conflict setups. Our method also leads to improved alignment to safety-critical scenarios beyond the training distribution, exhibiting increased robustness against jailbreak and prompt injection, reducing absolute attack success rates by up to 20%. Our results establish reasoning over instruction hierarchies as a practical mechanism for improving AI reliability, where targeted updates to system prompts produce predictable, controllable, and robust changes in model behavior.
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2511.04694 [cs.CL] |
| (or arXiv:2511.04694v5 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2511.04694
arXiv-issued DOI via DataCite
|
Submission history
From: Zishuo Zheng [view email][v1] Thu, 30 Oct 2025 22:13:31 UTC (256 KB)
[v2] Wed, 12 Nov 2025 01:19:01 UTC (256 KB)
[v3] Mon, 1 Dec 2025 21:07:46 UTC (260 KB)
[v4] Wed, 18 Feb 2026 02:51:47 UTC (260 KB)
[v5] Wed, 1 Jul 2026 16:09:28 UTC (306 KB)
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity
Jul 2
-
Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds
Jul 2
-
EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
Jul 2
-
Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.