News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 1mo ago Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography arXiv:2605.23035v1 Announce Type: new Abstract: Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap… 36 arXiv — NLP / Computation & Language research 1mo ago Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs arXiv:2605.23157v1 Announce Type: new Abstract: The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study… 32 arXiv — NLP / Computation & Language research 1mo ago Naturalistic measure of social norms alignment arXiv:2605.23420v1 Announce Type: new Abstract: Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice… 18 arXiv — NLP / Computation & Language research 1mo ago Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation arXiv:2412.14642v4 Announce Type: replace Abstract: Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one… 32 arXiv — NLP / Computation & Language research 1mo ago Training-Free Multimodal Large Language Model Orchestration arXiv:2508.10016v4 Announce Type: replace Abstract: Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free… 26 arXiv — NLP / Computation & Language research 1mo ago Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking arXiv:2602.17653v2 Announce Type: replace Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word… 21 Hugging Face Daily Papers research 1mo ago See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Abstract SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset. AI-generated summary We present SWIM… 35 Hugging Face Daily Papers research 1mo ago Geo-Align: Video Generation Alignment via Metric Geometry Reward Abstract Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction. AI-generated summary Camera-controlled video… 20 r/MachineLearning community 1mo ago Alignment: Higher order prioritizing over constraints [R] So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the details and the keys to jailbreaking. The nature of a transformer is to predict the… 25 Hugging Face Daily Papers research 1mo ago AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment Abstract AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation while improving generation quality in downstream tasks.… 36 Ars Technica — AI news-outlet 1mo ago Trump canceled AI safety testing EO after snub from tech CEOs Trump delays AI safety testing EO, claiming it would be an innovation “blocker.” 35 arXiv — Machine Learning research 1mo ago HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine arXiv:2605.21496v1 Announce Type: new Abstract: Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level… 4 arXiv — Machine Learning research 1mo ago Harnesses for Inference-Time Alignment over Execution Trajectories arXiv:2605.21516v1 Announce Type: new Abstract: Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate… 20 arXiv — Machine Learning research 1mo ago Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift arXiv:2605.21552v1 Announce Type: new Abstract: Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and… 26 arXiv — Machine Learning research 1mo ago From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment arXiv:2605.21558v1 Announce Type: new Abstract: Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated… 38 arXiv — Machine Learning research 1mo ago Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos arXiv:2605.21648v1 Announce Type: new Abstract: We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at… 37 arXiv — Machine Learning research 1mo ago Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization arXiv:2605.21801v1 Announce Type: new Abstract: Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish… 18 arXiv — Machine Learning research 1mo ago On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation arXiv:2605.21834v1 Announce Type: new Abstract: Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such… 31 arXiv — NLP / Computation & Language research 1mo ago CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety arXiv:2605.21609v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in… 13 arXiv — NLP / Computation & Language research 1mo ago Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries arXiv:2605.21712v1 Announce Type: new Abstract: Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites… 10 arXiv — NLP / Computation & Language research 1mo ago Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety arXiv:2605.22643v1 Announce Type: new Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the… 16 arXiv — NLP / Computation & Language research 1mo ago Boundary-targeted Membership Inference Attacks on Safety Classifiers arXiv:2605.22373v1 Announce Type: cross Abstract: Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on… 6 TechCrunch — AI news-outlet 1mo ago The Path, founded by Tony Robbins and Calm alums, hopes to offer safer AI therapy The Path says its AI model has scored 95 on the mental health safety AI benchmark, Vera-MH. This compares to a top score of 65 for the consumer bots. 4 Hugging Face Daily Papers research 1mo ago CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing Abstract Current GUI agents show limited effectiveness in professional media post-production tasks despite advances in spatial grounding and multimodal alignment. AI-generated summary While GUI agents have made significant progress in web navigation and basic operating system… 13 Hugging Face Daily Papers research 1mo ago Stitched Value Model for Diffusion Alignment Abstract StitchVM efficiently transfers pretrained pixel-space reward models to noisy latent spaces for diffusion model alignment through a lightweight model stitching framework. AI-generated summary For practical use, diffusion- or flow-based generative models must be aligned… 4 Hugging Face Daily Papers research 1mo ago Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection Abstract Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection. AI-generated summary Safety post-training can… 32 arXiv — Machine Learning research 1mo ago Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry arXiv:2605.20241v1 Announce Type: new Abstract: Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular,… 8 arXiv — Machine Learning research 1mo ago Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs arXiv:2605.20270v1 Announce Type: new Abstract: A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety… 28 arXiv — Machine Learning research 1mo ago Spectral Souping: A Unified Framework for Online Preference Alignment arXiv:2605.20408v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this… 26 arXiv — Machine Learning research 1mo ago REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak arXiv:2605.20654v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal… 36 arXiv — Machine Learning research 1mo ago Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment arXiv:2605.20780v1 Announce Type: new Abstract: Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce… 8 arXiv — NLP / Computation & Language research 1mo ago Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning arXiv:2605.20730v1 Announce Type: new Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising… 22 arXiv — NLP / Computation & Language research 1mo ago Towards Context-Invariant Safety Alignment for Large Language Models arXiv:2605.20994v1 Announce Type: new Abstract: Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording.… 34 arXiv — NLP / Computation & Language research 1mo ago Cross-lingual robustness of LLM-brain alignment and its computational roots arXiv:2605.21049v1 Announce Type: new Abstract: Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such… 35 arXiv — NLP / Computation & Language research 1mo ago LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models arXiv:2605.21362v1 Announce Type: new Abstract: Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits… 21 Hugging Face Daily Papers research 1mo ago Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment Abstract Direct Preference Optimization (DPO) is theoretically equivalent to Reinforcement Learning from Human Feedback (RLHF) only under specific assumptions, otherwise optimizing different objectives; Constrained Preference Optimization (CPO) is proposed as a solution with… 17 Hugging Face Daily Papers research 1mo ago When Vision Speaks for Sound Abstract Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment. AI-generated summary Despite… 34 Hugging Face Daily Papers research 1mo ago Semantic Generative Tuning for Unified Multimodal Models Abstract Generative post-training with semantic segmentation as a proxy enhances multimodal alignment and performance in unified models. AI-generated summary Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single… 20 arXiv — Machine Learning research 1mo ago Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training arXiv:2605.18822v1 Announce Type: new Abstract: Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable… 28 arXiv — Machine Learning research 1mo ago Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin arXiv:2605.18823v1 Announce Type: new Abstract: Digital twins (DTs) for urban transportation systems have gained increasing attention; however, their systematic evaluation in safety-critical scenarios remains limited. This paper presents a multi-pedestrian safety warning system… 24 arXiv — Machine Learning research 1mo ago Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling arXiv:2605.18838v1 Announce Type: new Abstract: Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a… 10 arXiv — Machine Learning research 1mo ago From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning arXiv:2605.18841v1 Announce Type: new Abstract: Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and… 6 arXiv — Machine Learning research 1mo ago Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints arXiv:2605.18842v1 Announce Type: new Abstract: Safe reinforcement learning in nonstationary environments requires safety mechanisms that adapt as environmental conditions change. Standard safe reinforcement learning methods often assume fixed constraints or stable environmental… 20 arXiv — Machine Learning research 1mo ago ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models arXiv:2605.18879v1 Announce Type: new Abstract: Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning… 32 arXiv — NLP / Computation & Language research 1mo ago LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models arXiv:2605.19416v1 Announce Type: new Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled… 5 arXiv — NLP / Computation & Language research 1mo ago GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment arXiv:2605.19577v1 Announce Type: new Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter… 17 arXiv — NLP / Computation & Language research 1mo ago CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving arXiv:2605.19837v1 Announce Type: cross Abstract: Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time… 32 Hugging Face Daily Papers research 1mo ago GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment Abstract GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology. AI-generated summary We present GoLongRL, a fully open-source,… 37 arXiv — Machine Learning research 1mo ago Goal-Conditioned Supervised Learning for LLM Fine-Tuning arXiv:2605.16345v1 Announce Type: new Abstract: Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment,… 28 arXiv — Machine Learning research 1mo ago Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need? arXiv:2605.16354v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety… 25 Page 10 of 10 · 500 articles ← Newer