Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — Machine Learning research 1mo ago

Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

arXiv:2605.30991v1 Announce Type: new Abstract: Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this…

14
arXiv — NLP / Computation & Language research 1mo ago

Configurable Reward Model for Balanced Safety Alignment

arXiv:2605.30487v1 Announce Type: new Abstract: Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety…

11
arXiv — NLP / Computation & Language research 1mo ago

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

arXiv:2605.30675v1 Announce Type: new Abstract: Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the…

5
arXiv — NLP / Computation & Language research 1mo ago

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

arXiv:2605.30723v1 Announce Type: new Abstract: LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as…

15
arXiv — NLP / Computation & Language research 1mo ago

Pairwise Reference Alignment as a Model-Level Ordinal Observable

arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference…

18
arXiv — NLP / Computation & Language research 1mo ago

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

arXiv:2605.30888v1 Announce Type: new Abstract: Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the…

31
arXiv — NLP / Computation & Language research 1mo ago

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

arXiv:2605.31073v1 Announce Type: new Abstract: Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent…

16
arXiv — NLP / Computation & Language research 1mo ago

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

arXiv:2605.31328v1 Announce Type: new Abstract: Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT)…

20
arXiv — NLP / Computation & Language research 1mo ago

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues…

36
r/LocalLLaMA community 1mo ago

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities. coder3101's variant achieves 96% ASR with capability…

17
Hugging Face Daily Papers research 1mo ago

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Abstract Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling. AI-generated…

17
arXiv — Machine Learning research 1mo ago

Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents

arXiv:2605.28850v1 Announce Type: new Abstract: We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. Using TradeArena, an auditable trading-agent testbed with risk reports, execution simulation,…

26
arXiv — Machine Learning research 1mo ago

Representation Alignment Rests on Linear Structure

arXiv:2605.28870v1 Announce Type: new Abstract: We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal…

11
arXiv — Machine Learning research 1mo ago

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

arXiv:2605.28975v1 Announce Type: new Abstract: We study the log-alignment ratio (LAR), a measure of parameter-activation alignment, introduced in parameterization theory. We reformulate it as the overlap between a weight spectrum $p$ of the normalized squared singular values of…

37
arXiv — Machine Learning research 1mo ago

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

arXiv:2605.29028v1 Announce Type: new Abstract: Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their…

17
arXiv — Machine Learning research 1mo ago

PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

arXiv:2605.29158v1 Announce Type: new Abstract: Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose…

12
arXiv — Machine Learning research 1mo ago

SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction

arXiv:2605.29236v1 Announce Type: new Abstract: Alarm fatigue in intensive care units (ICUs) is a well documented patient safety crisis. Clinical monitors generate 350 or more alarms per patient per day, out of which 72-99% are clinically irrelevant. Staff desensitization to…

29
arXiv — Machine Learning research 1mo ago

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

arXiv:2605.29659v1 Announce Type: new Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail…

4
arXiv — NLP / Computation & Language research 1mo ago

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

arXiv:2605.28826v1 Announce Type: new Abstract: In modern LLMs, linguistic features function not as stylistic artifacts but as probes of probability mass, allocated under training alignment objectives. Language models trained with contemporary pipelines exhibit severe reshaping…

35
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated…

19
arXiv — NLP / Computation & Language research 1mo ago

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how…

31
arXiv — NLP / Computation & Language research 1mo ago

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

arXiv:2605.29224v1 Announce Type: new Abstract: AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment…

15
arXiv — NLP / Computation & Language research 1mo ago

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods…

19
arXiv — NLP / Computation & Language research 1mo ago

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

arXiv:2605.29365v1 Announce Type: new Abstract: Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human…

20
arXiv — NLP / Computation & Language research 1mo ago

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

arXiv:2605.29414v1 Announce Type: new Abstract: Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However,…

32
arXiv — NLP / Computation & Language research 1mo ago

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

arXiv:2605.29458v1 Announce Type: new Abstract: Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and…

36
arXiv — NLP / Computation & Language research 1mo ago

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries,…

9
arXiv — NLP / Computation & Language research 1mo ago

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

arXiv:2605.29708v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled…

36
Hugging Face Daily Papers research 1mo ago

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Abstract Vision-language models suffer from modality sensitivity due to training data bias, but a new data curation approach called Local Modality Substitution improves cross-modal representation alignment and reasoning performance. AI-generated summary Vision-Language Models…

26
Hugging Face Daily Papers research 1mo ago

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Abstract A lightweight and scalable agent safety alignment framework is proposed to address emerging threats from advanced AI models, featuring taxonomy-guided training with minimal samples and efficient deployment in real-world scenarios. AI-generated summary Modern open-world…

23
Hugging Face Daily Papers research 1mo ago

Native Audio-Visual Alignment for Generation

Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and…

38
The Information — AI news-outlet 1mo ago

Illinois Legislature Passes Landmark AI Safety Bill

On Wednesday, the Illinois House of Representatives passed a bill that will require major AI companies to submit their model safety plans for third-party audits, as well as creating whistleblower protections for those companies’ employees. While Governor JB Pritzker still has to…

9
Ars Technica — AI news-outlet 1mo ago

Trump loses more control over AI regulation as Illinois passes landmark law

Here’s why Anthropic and OpenAI are on board with Illinois safety testing.

9
arXiv — Machine Learning research 1mo ago

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

arXiv:2605.27659v1 Announce Type: new Abstract: Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world…

15
arXiv — Machine Learning research 1mo ago

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

arXiv:2605.27758v1 Announce Type: new Abstract: Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While…

18
arXiv — Machine Learning research 1mo ago

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized…

17
arXiv — Machine Learning research 1mo ago

FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

arXiv:2605.27892v1 Announce Type: new Abstract: Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are…

32
arXiv — Machine Learning research 1mo ago

AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels

arXiv:2605.28021v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on…

25
arXiv — Machine Learning research 1mo ago

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

arXiv:2605.28030v1 Announce Type: new Abstract: Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a…

22
arXiv — NLP / Computation & Language research 1mo ago

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

arXiv:2605.27374v1 Announce Type: new Abstract: Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical…

26
arXiv — NLP / Computation & Language research 1mo ago

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

arXiv:2605.27383v1 Announce Type: new Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by…

24
arXiv — NLP / Computation & Language research 1mo ago

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

arXiv:2605.27388v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical…

16
arXiv — NLP / Computation & Language research 1mo ago

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

arXiv:2605.27545v1 Announce Type: new Abstract: Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple…

38
arXiv — NLP / Computation & Language research 1mo ago

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

arXiv:2605.27690v1 Announce Type: new Abstract: LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore…

22
arXiv — NLP / Computation & Language research 1mo ago

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

arXiv:2605.27901v1 Announce Type: new Abstract: Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse…

30
arXiv — NLP / Computation & Language research 1mo ago

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

arXiv:2605.28013v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations:…

18
arXiv — NLP / Computation & Language research 1mo ago

Chinese Word Boundary Recovery through Character Alignment Projection

arXiv:2605.28128v1 Announce Type: new Abstract: Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper…

30
arXiv — NLP / Computation & Language research 1mo ago

Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

arXiv:2605.28188v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but…

23
arXiv — NLP / Computation & Language research 1mo ago

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

arXiv:2605.28292v1 Announce Type: new Abstract: Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example…

31
arXiv — NLP / Computation & Language research 1mo ago

HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment

arXiv:2605.28308v1 Announce Type: new Abstract: Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject…

11

Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

Configurable Reward Model for Balanced Safety Alignment

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

Pairwise Reference Alignment as a Model-Level Ordinal Observable

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents

Representation Alignment Rests on Linear Structure

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Native Audio-Visual Alignment for Generation

Illinois Legislature Passes Landmark AI Safety Bill

Trump loses more control over AI regulation as Illinois passes landmark law

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

Chinese Word Boundary Recovery through Character Alignment Projection

Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment