Tag

Safety + alignment

500 articles archived under #safety · RSS

OpenAI official-blog 1mo ago

OpenAI’s Frontier Governance Framework

Explore OpenAI’s Frontier Governance Framework and how our AI safety, security, and risk practices align with emerging EU and California regulations.

15
Hugging Face Daily Papers research 1mo ago

D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

Abstract Diffusion large language models generate text through multi-step denoising processes that expose intermediate representations useful for safety monitoring, leading to the development of a bi-level safety monitor that dynamically routes computational resources based on…

35
arXiv — Machine Learning research 1mo ago

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv:2605.26121v1 Announce Type: new Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering…

27
arXiv — Machine Learning research 1mo ago

Curriculum Learning for Safety Alignment

arXiv:2605.26315v1 Announce Type: new Abstract: Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate…

20
arXiv — Machine Learning research 1mo ago

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

arXiv:2605.26491v1 Announce Type: new Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary…

10
arXiv — Machine Learning research 1mo ago

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

arXiv:2605.26552v1 Announce Type: new Abstract: Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV,…

36
arXiv — Machine Learning research 1mo ago

Linear and Neural Dueling Bandits with Delayed Feedback

arXiv:2605.26554v1 Announce Type: new Abstract: Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption…

35
arXiv — NLP / Computation & Language research 1mo ago

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

arXiv:2605.26365v1 Announce Type: new Abstract: Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access…

33
arXiv — NLP / Computation & Language research 1mo ago

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay…

34
arXiv — NLP / Computation & Language research 1mo ago

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

arXiv:2605.26442v1 Announce Type: new Abstract: Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment…

17
arXiv — NLP / Computation & Language research 1mo ago

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

arXiv:2605.26463v1 Announce Type: new Abstract: Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency…

7
arXiv — NLP / Computation & Language research 1mo ago

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

arXiv:2605.26785v1 Announce Type: new Abstract: Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability:…

21
arXiv — NLP / Computation & Language research 1mo ago

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs…

27
arXiv — NLP / Computation & Language research 1mo ago

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

arXiv:2605.26947v1 Announce Type: new Abstract: Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas…

5
arXiv — NLP / Computation & Language research 1mo ago

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

arXiv:2605.26954v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation…

25
arXiv — NLP / Computation & Language research 1mo ago

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments…

27
arXiv — NLP / Computation & Language research 1mo ago

Grounding Text Embeddings in Stakeholder Associations

arXiv:2605.27168v1 Announce Type: new Abstract: Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding…

17
Hugging Face Daily Papers research 1mo ago

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing…

32
Hugging Face Daily Papers research 1mo ago

Cross-scale Aligned Supervision for Training GANs

Abstract Standard GANs with adversarial supervision on intermediate outputs fail to maintain consistent sample trajectories across scales, leading to misalignment; a new transformer-based approach called CAT addresses this by enforcing consistency between intermediate and final…

28
Hugging Face Daily Papers research 1mo ago

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Abstract Open-source large language models exhibit varying political expressivity and vulnerability to jailbreak techniques, necessitating systematic red-teaming frameworks for assessing their potential misuse in influence campaigns. AI-generated summary As large language model…

25
Hugging Face Daily Papers research 1mo ago

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Abstract Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation. AI-generated…

7
Hugging Face Daily Papers research 1mo ago

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Abstract Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining. AI-generated summary Text-to-image diffusion models like Stable Diffusion generate high-quality images from…

35
Hugging Face Daily Papers research 1mo ago

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

Abstract A natural language interface for transportation safety analysis uses large language models to translate user queries into structured spatial operations while maintaining deterministic database execution for reliable and reproducible results. AI-generated summary…

21
r/LocalLLaMA community 1mo ago

qwen 3.6 27B AR-> Diffusion - local training on 5090

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get…

22
Hugging Face Daily Papers research 1mo ago

SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Abstract SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages. AI-generated summary Sparse encoders offer high-precision retrieval by…

23
Hugging Face Daily Papers research 1mo ago

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Abstract ASASR addresses spectral misalignment in image super-resolution by leveraging Riemannian geometry and adversarial training to improve structural fidelity and reduce artifacts. AI-generated summary Generative priors in Image Super-Resolution (SR) often compromise…

10
Hugging Face Daily Papers research 1mo ago

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Abstract RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. AI-generated summary Recent advances in few-step diffusion distillation have…

30
arXiv — Machine Learning research 1mo ago

AvAtar: Learning to Align via Active Optimal Transport

arXiv:2605.24395v1 Announce Type: new Abstract: Alignment plays a fundamental role in many machine learning problems, such as multi-network analysis, multimodal learning, and point cloud registration. Recent works increasingly leverage optimal transport (OT) for distributional…

12
arXiv — Machine Learning research 1mo ago

An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits

arXiv:2605.24583v1 Announce Type: new Abstract: We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix…

8
arXiv — Machine Learning research 1mo ago

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

arXiv:2605.24649v1 Announce Type: new Abstract: Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy.…

15
arXiv — Machine Learning research 1mo ago

The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

arXiv:2605.24782v1 Announce Type: new Abstract: While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based…

28
arXiv — NLP / Computation & Language research 1mo ago

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

arXiv:2605.23954v1 Announce Type: new Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement,…

36
arXiv — NLP / Computation & Language research 1mo ago

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

arXiv:2605.23974v1 Announce Type: new Abstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing…

36
arXiv — NLP / Computation & Language research 1mo ago

Measuring the Depth of LLM Unlearning via Activation Patching

arXiv:2605.24614v1 Announce Type: new Abstract: Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail…

17
arXiv — NLP / Computation & Language research 1mo ago

Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA

arXiv:2605.25204v1 Announce Type: new Abstract: Pluralistic alignment requires systems to adapt to diverse user values, communication styles, and contextual assumptions. We believe that a foundational prerequisite for such alignment enabling accurate preference elicitation from…

34
arXiv — NLP / Computation & Language research 1mo ago

MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

arXiv:2605.25342v1 Announce Type: new Abstract: Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require…

29
arXiv — NLP / Computation & Language research 1mo ago

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

arXiv:2605.25415v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of…

17
arXiv — NLP / Computation & Language research 1mo ago

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

arXiv:2605.25420v1 Announce Type: new Abstract: Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0,…

13
Hacker News — AI on Front Page community 1mo ago

What we lost when we stopped letting kids leave the front yard

Article URL: https://stevemagness.substack.com/p/the-cost-of-safetyism Comments URL: https://news.ycombinator.com/item?id=48267290 Points: 227 # Comments: 201

17
r/MachineLearning community 1mo ago

Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R]

I have been seeing a lot of really interesting work lately around unlearning, model editing, controllability, safety, etc. Feels like this space is moving very fast right now, and there are still so many open questions. This year I’m helping organize the U&ME workshop at ECCV…

27
Hugging Face Daily Papers research 1mo ago

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Abstract LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes. AI-generated summary Unified…

30
arXiv — Machine Learning research 1mo ago

Test-Time Training Undermines Safety Guardrails

arXiv:2605.22984v1 Announce Type: new Abstract: Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning.…

24
arXiv — Machine Learning research 1mo ago

Convex Optimization for Alignment and Preference Learning on a Single GPU

arXiv:2605.23244v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain…

20
arXiv — Machine Learning research 1mo ago

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

arXiv:2605.23351v1 Announce Type: new Abstract: We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy.…

14
arXiv — Machine Learning research 1mo ago

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

arXiv:2605.23471v1 Announce Type: new Abstract: Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their…

28
arXiv — Machine Learning research 1mo ago

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

arXiv:2605.23522v1 Announce Type: new Abstract: Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the…

38
arXiv — NLP / Computation & Language research 1mo ago

Evaluating Large Language Models in a Complex Hidden Role Game

arXiv:2605.22826v1 Announce Type: new Abstract: Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of…

22
arXiv — NLP / Computation & Language research 1mo ago

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

arXiv:2605.22880v1 Announce Type: new Abstract: As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus…

34
arXiv — NLP / Computation & Language research 1mo ago

Graph Alignment Topology as an Inductive Bias for Grounding Detection

arXiv:2605.22963v1 Announce Type: new Abstract: Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables…

12
arXiv — NLP / Computation & Language research 1mo ago

Brain-LLM Alignment Tracks Training Data, Not Typology

arXiv:2605.23032v1 Announce Type: new Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this…

20

OpenAI’s Frontier Governance Framework

D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

Curriculum Learning for Safety Alignment

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

Linear and Neural Dueling Bandits with Delayed Feedback

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

Grounding Text Embeddings in Stakeholder Associations

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Cross-scale Aligned Supervision for Training GANs

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

qwen 3.6 27B AR-> Diffusion - local training on 5090

SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

AvAtar: Learning to Align via Active Optimal Transport

An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

Measuring the Depth of LLM Unlearning via Activation Patching

Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA

MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

What we lost when we stopped letting kids leave the front yard

Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R]

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Test-Time Training Undermines Safety Guardrails

Convex Optimization for Alignment and Preference Learning on a Single GPU

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Evaluating Large Language Models in a Complex Hidden Role Game

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Graph Alignment Topology as an Inductive Bias for Grounding Detection

Brain-LLM Alignment Tracks Training Data, Not Typology