Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 1mo ago

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

arXiv:2605.23035v1 Announce Type: new Abstract: Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap…

36
arXiv — NLP / Computation & Language research 1mo ago

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

arXiv:2605.23157v1 Announce Type: new Abstract: The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study…

32
arXiv — NLP / Computation & Language research 1mo ago

Naturalistic measure of social norms alignment

arXiv:2605.23420v1 Announce Type: new Abstract: Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice…

18
arXiv — NLP / Computation & Language research 1mo ago

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

arXiv:2412.14642v4 Announce Type: replace Abstract: Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one…

32
arXiv — NLP / Computation & Language research 1mo ago

Training-Free Multimodal Large Language Model Orchestration

arXiv:2508.10016v4 Announce Type: replace Abstract: Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free…

26
arXiv — NLP / Computation & Language research 1mo ago

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

arXiv:2602.17653v2 Announce Type: replace Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word…

21
Hugging Face Daily Papers research 1mo ago

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Abstract SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset. AI-generated summary We present SWIM…

35
Hugging Face Daily Papers research 1mo ago

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Abstract Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction. AI-generated summary Camera-controlled video…

20
r/MachineLearning community 1mo ago

Alignment: Higher order prioritizing over constraints [R]

So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the details and the keys to jailbreaking. The nature of a transformer is to predict the…

25
Hugging Face Daily Papers research 1mo ago

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Abstract AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation while improving generation quality in downstream tasks.…

36
Ars Technica — AI news-outlet 1mo ago

Trump canceled AI safety testing EO after snub from tech CEOs

Trump delays AI safety testing EO, claiming it would be an innovation “blocker.”

35
arXiv — Machine Learning research 1mo ago

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

arXiv:2605.21496v1 Announce Type: new Abstract: Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level…

4
arXiv — Machine Learning research 1mo ago

Harnesses for Inference-Time Alignment over Execution Trajectories

arXiv:2605.21516v1 Announce Type: new Abstract: Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate…

20
arXiv — Machine Learning research 1mo ago

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

arXiv:2605.21552v1 Announce Type: new Abstract: Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and…

26
arXiv — Machine Learning research 1mo ago

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

arXiv:2605.21558v1 Announce Type: new Abstract: Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated…

38
arXiv — Machine Learning research 1mo ago

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

arXiv:2605.21648v1 Announce Type: new Abstract: We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at…

37
arXiv — Machine Learning research 1mo ago

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

arXiv:2605.21801v1 Announce Type: new Abstract: Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish…

18
arXiv — Machine Learning research 1mo ago

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

arXiv:2605.21834v1 Announce Type: new Abstract: Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such…

31
arXiv — NLP / Computation & Language research 1mo ago

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

arXiv:2605.21609v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in…

13
arXiv — NLP / Computation & Language research 1mo ago

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

arXiv:2605.21712v1 Announce Type: new Abstract: Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites…

10
arXiv — NLP / Computation & Language research 1mo ago

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

arXiv:2605.22643v1 Announce Type: new Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the…

16
arXiv — NLP / Computation & Language research 1mo ago

Boundary-targeted Membership Inference Attacks on Safety Classifiers

arXiv:2605.22373v1 Announce Type: cross Abstract: Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on…

6
TechCrunch — AI news-outlet 1mo ago

The Path, founded by Tony Robbins and Calm alums, hopes to offer safer AI therapy

The Path says its AI model has scored 95 on the mental health safety AI benchmark, Vera-MH. This compares to a top score of 65 for the consumer bots.

4
Hugging Face Daily Papers research 1mo ago

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Abstract Current GUI agents show limited effectiveness in professional media post-production tasks despite advances in spatial grounding and multimodal alignment. AI-generated summary While GUI agents have made significant progress in web navigation and basic operating system…

13
Hugging Face Daily Papers research 1mo ago

Stitched Value Model for Diffusion Alignment

Abstract StitchVM efficiently transfers pretrained pixel-space reward models to noisy latent spaces for diffusion model alignment through a lightweight model stitching framework. AI-generated summary For practical use, diffusion- or flow-based generative models must be aligned…

4
Hugging Face Daily Papers research 1mo ago

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Abstract Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection. AI-generated summary Safety post-training can…

32
arXiv — Machine Learning research 1mo ago

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

arXiv:2605.20241v1 Announce Type: new Abstract: Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular,…

8
arXiv — Machine Learning research 1mo ago

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

arXiv:2605.20270v1 Announce Type: new Abstract: A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety…

28
arXiv — Machine Learning research 1mo ago

Spectral Souping: A Unified Framework for Online Preference Alignment

arXiv:2605.20408v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this…

26
arXiv — Machine Learning research 1mo ago

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

arXiv:2605.20654v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal…

36
arXiv — Machine Learning research 1mo ago

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

arXiv:2605.20780v1 Announce Type: new Abstract: Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce…

8
arXiv — NLP / Computation & Language research 1mo ago

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

arXiv:2605.20730v1 Announce Type: new Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising…

22
arXiv — NLP / Computation & Language research 1mo ago

Towards Context-Invariant Safety Alignment for Large Language Models

arXiv:2605.20994v1 Announce Type: new Abstract: Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording.…

34
arXiv — NLP / Computation & Language research 1mo ago

Cross-lingual robustness of LLM-brain alignment and its computational roots

arXiv:2605.21049v1 Announce Type: new Abstract: Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such…

35
arXiv — NLP / Computation & Language research 1mo ago

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

arXiv:2605.21362v1 Announce Type: new Abstract: Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits…

21
Hugging Face Daily Papers research 1mo ago

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Abstract Direct Preference Optimization (DPO) is theoretically equivalent to Reinforcement Learning from Human Feedback (RLHF) only under specific assumptions, otherwise optimizing different objectives; Constrained Preference Optimization (CPO) is proposed as a solution with…

17
Hugging Face Daily Papers research 1mo ago

When Vision Speaks for Sound

Abstract Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment. AI-generated summary Despite…

34
Hugging Face Daily Papers research 1mo ago

Semantic Generative Tuning for Unified Multimodal Models

Abstract Generative post-training with semantic segmentation as a proxy enhances multimodal alignment and performance in unified models. AI-generated summary Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single…

20
arXiv — Machine Learning research 1mo ago

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

arXiv:2605.18822v1 Announce Type: new Abstract: Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable…

28
arXiv — Machine Learning research 1mo ago

Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin

arXiv:2605.18823v1 Announce Type: new Abstract: Digital twins (DTs) for urban transportation systems have gained increasing attention; however, their systematic evaluation in safety-critical scenarios remains limited. This paper presents a multi-pedestrian safety warning system…

24
arXiv — Machine Learning research 1mo ago

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

arXiv:2605.18838v1 Announce Type: new Abstract: Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a…

10
arXiv — Machine Learning research 1mo ago

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

arXiv:2605.18841v1 Announce Type: new Abstract: Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and…

6
arXiv — Machine Learning research 1mo ago

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

arXiv:2605.18842v1 Announce Type: new Abstract: Safe reinforcement learning in nonstationary environments requires safety mechanisms that adapt as environmental conditions change. Standard safe reinforcement learning methods often assume fixed constraints or stable environmental…

20
arXiv — Machine Learning research 1mo ago

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

arXiv:2605.18879v1 Announce Type: new Abstract: Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning…

32
arXiv — NLP / Computation & Language research 1mo ago

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

arXiv:2605.19416v1 Announce Type: new Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled…

5
arXiv — NLP / Computation & Language research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

arXiv:2605.19577v1 Announce Type: new Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter…

17
arXiv — NLP / Computation & Language research 1mo ago

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

arXiv:2605.19837v1 Announce Type: cross Abstract: Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time…

32
Hugging Face Daily Papers research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Abstract GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology. AI-generated summary We present GoLongRL, a fully open-source,…

37
arXiv — Machine Learning research 1mo ago

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

arXiv:2605.16345v1 Announce Type: new Abstract: Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment,…

28
arXiv — Machine Learning research 1mo ago

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv:2605.16354v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety…

25

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Naturalistic measure of social norms alignment

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Training-Free Multimodal Large Language Model Orchestration

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Alignment: Higher order prioritizing over constraints [R]

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Trump canceled AI safety testing EO after snub from tech CEOs

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

Harnesses for Inference-Time Alignment over Execution Trajectories

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Boundary-targeted Membership Inference Attacks on Safety Classifiers

The Path, founded by Tony Robbins and Calm alums, hopes to offer safer AI therapy

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Stitched Value Model for Diffusion Alignment

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

Spectral Souping: A Unified Framework for Online Preference Alignment

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

Towards Context-Invariant Safety Alignment for Large Language Models

Cross-lingual robustness of LLM-brain alignment and its computational roots

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

When Vision Speaks for Sound

Semantic Generative Tuning for Unified Multimodal Models

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?