News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow OpenAI official-blog 1mo ago OpenAI’s Frontier Governance Framework Explore OpenAI’s Frontier Governance Framework and how our AI safety, security, and risk practices align with emerging EU and California regulations. 15 Hugging Face Daily Papers research 1mo ago D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing Abstract Diffusion large language models generate text through multi-step denoising processes that expose intermediate representations useful for safety monitoring, leading to the development of a bi-level safety monitor that dynamically routes computational resources based on… 35 arXiv — Machine Learning research 1mo ago GEM: Geometric Entropy Mixing for Optimal LLM Data Curation arXiv:2605.26121v1 Announce Type: new Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering… 27 arXiv — Machine Learning research 1mo ago Curriculum Learning for Safety Alignment arXiv:2605.26315v1 Announce Type: new Abstract: Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate… 20 arXiv — Machine Learning research 1mo ago Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models arXiv:2605.26491v1 Announce Type: new Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary… 10 arXiv — Machine Learning research 1mo ago Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference arXiv:2605.26552v1 Announce Type: new Abstract: Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV,… 36 arXiv — Machine Learning research 1mo ago Linear and Neural Dueling Bandits with Delayed Feedback arXiv:2605.26554v1 Announce Type: new Abstract: Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption… 35 arXiv — NLP / Computation & Language research 1mo ago Cultural Value Alignment Via Latent Activation Steering in Large Language Models arXiv:2605.26365v1 Announce Type: new Abstract: Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access… 33 arXiv — NLP / Computation & Language research 1mo ago LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay… 34 arXiv — NLP / Computation & Language research 1mo ago Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines arXiv:2605.26442v1 Announce Type: new Abstract: Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment… 17 arXiv — NLP / Computation & Language research 1mo ago Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records arXiv:2605.26463v1 Announce Type: new Abstract: Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency… 7 arXiv — NLP / Computation & Language research 1mo ago EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation arXiv:2605.26785v1 Announce Type: new Abstract: Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability:… 21 arXiv — NLP / Computation & Language research 1mo ago Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs… 27 arXiv — NLP / Computation & Language research 1mo ago KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models arXiv:2605.26947v1 Announce Type: new Abstract: Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas… 5 arXiv — NLP / Computation & Language research 1mo ago AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian arXiv:2605.26954v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation… 25 arXiv — NLP / Computation & Language research 1mo ago Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments… 27 arXiv — NLP / Computation & Language research 1mo ago Grounding Text Embeddings in Stakeholder Associations arXiv:2605.27168v1 Announce Type: new Abstract: Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding… 17 Hugging Face Daily Papers research 1mo ago LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing… 32 Hugging Face Daily Papers research 1mo ago Cross-scale Aligned Supervision for Training GANs Abstract Standard GANs with adversarial supervision on intermediate outputs fail to maintain consistent sample trajectories across scales, leading to misalignment; a new transformer-based approach called CAT addresses this by enforcing consistency between intermediate and final… 28 Hugging Face Daily Papers research 1mo ago How Far Will They Go? Red-Teaming Online Influence with Large Language Models Abstract Open-source large language models exhibit varying political expressivity and vulnerability to jailbreak techniques, necessitating systematic red-teaming frameworks for assessing their potential misuse in influence campaigns. AI-generated summary As large language model… 25 Hugging Face Daily Papers research 1mo ago Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models Abstract Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation. AI-generated… 7 Hugging Face Daily Papers research 1mo ago Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference Abstract Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining. AI-generated summary Text-to-image diffusion models like Stable Diffusion generate high-quality images from… 35 Hugging Face Daily Papers research 1mo ago Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries Abstract A natural language interface for transportation safety analysis uses large language models to translate user queries into structured spatial operations while maintaining deterministic database execution for reliable and reproducible results. AI-generated summary… 21 r/LocalLLaMA community 1mo ago qwen 3.6 27B AR-> Diffusion - local training on 5090 based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get… 22 Hugging Face Daily Papers research 1mo ago SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges Abstract SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages. AI-generated summary Sparse encoders offer high-precision retrieval by… 23 Hugging Face Daily Papers research 1mo ago Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution Abstract ASASR addresses spectral misalignment in image super-resolution by leveraging Riemannian geometry and adversarial training to improve structural fidelity and reduce artifacts. AI-generated summary Generative priors in Image Super-Resolution (SR) often compromise… 10 Hugging Face Daily Papers research 1mo ago Reinforcing Few-step Generators via Reward-Tilted Distribution Matching Abstract RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. AI-generated summary Recent advances in few-step diffusion distillation have… 30 arXiv — Machine Learning research 1mo ago AvAtar: Learning to Align via Active Optimal Transport arXiv:2605.24395v1 Announce Type: new Abstract: Alignment plays a fundamental role in many machine learning problems, such as multi-network analysis, multimodal learning, and point cloud registration. Recent works increasingly leverage optimal transport (OT) for distributional… 12 arXiv — Machine Learning research 1mo ago An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits arXiv:2605.24583v1 Announce Type: new Abstract: We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix… 8 arXiv — Machine Learning research 1mo ago On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks arXiv:2605.24649v1 Announce Type: new Abstract: Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy.… 15 arXiv — Machine Learning research 1mo ago The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench arXiv:2605.24782v1 Announce Type: new Abstract: While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based… 28 arXiv — NLP / Computation & Language research 1mo ago EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs arXiv:2605.23954v1 Announce Type: new Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement,… 36 arXiv — NLP / Computation & Language research 1mo ago AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue arXiv:2605.23974v1 Announce Type: new Abstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing… 36 arXiv — NLP / Computation & Language research 1mo ago Measuring the Depth of LLM Unlearning via Activation Patching arXiv:2605.24614v1 Announce Type: new Abstract: Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail… 17 arXiv — NLP / Computation & Language research 1mo ago Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA arXiv:2605.25204v1 Announce Type: new Abstract: Pluralistic alignment requires systems to adapt to diverse user values, communication styles, and contextual assumptions. We believe that a foundational prerequisite for such alignment enabling accurate preference elicitation from… 34 arXiv — NLP / Computation & Language research 1mo ago MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models arXiv:2605.25342v1 Announce Type: new Abstract: Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require… 29 arXiv — NLP / Computation & Language research 1mo ago LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers arXiv:2605.25415v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of… 17 arXiv — NLP / Computation & Language research 1mo ago SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models arXiv:2605.25420v1 Announce Type: new Abstract: Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0,… 13 Hacker News — AI on Front Page community 1mo ago What we lost when we stopped letting kids leave the front yard Article URL: https://stevemagness.substack.com/p/the-cost-of-safetyism Comments URL: https://news.ycombinator.com/item?id=48267290 Points: 227 # Comments: 201 17 r/MachineLearning community 1mo ago Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R] I have been seeing a lot of really interesting work lately around unlearning, model editing, controllability, safety, etc. Feels like this space is moving very fast right now, and there are still so many open questions. This year I’m helping organize the U&ME workshop at ECCV… 27 Hugging Face Daily Papers research 1mo ago LatentUMM: Dual Latent Alignment for Unified Multimodal Models Abstract LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes. AI-generated summary Unified… 30 arXiv — Machine Learning research 1mo ago Test-Time Training Undermines Safety Guardrails arXiv:2605.22984v1 Announce Type: new Abstract: Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning.… 24 arXiv — Machine Learning research 1mo ago Convex Optimization for Alignment and Preference Learning on a Single GPU arXiv:2605.23244v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain… 20 arXiv — Machine Learning research 1mo ago Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays arXiv:2605.23351v1 Announce Type: new Abstract: We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy.… 14 arXiv — Machine Learning research 1mo ago CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection arXiv:2605.23471v1 Announce Type: new Abstract: Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their… 28 arXiv — Machine Learning research 1mo ago Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models arXiv:2605.23522v1 Announce Type: new Abstract: Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the… 38 arXiv — NLP / Computation & Language research 1mo ago Evaluating Large Language Models in a Complex Hidden Role Game arXiv:2605.22826v1 Announce Type: new Abstract: Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of… 22 arXiv — NLP / Computation & Language research 1mo ago How Far Will They Go? Red-Teaming Online Influence with Large Language Models arXiv:2605.22880v1 Announce Type: new Abstract: As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus… 34 arXiv — NLP / Computation & Language research 1mo ago Graph Alignment Topology as an Inductive Bias for Grounding Detection arXiv:2605.22963v1 Announce Type: new Abstract: Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables… 12 arXiv — NLP / Computation & Language research 1mo ago Brain-LLM Alignment Tracks Training Data, Not Typology arXiv:2605.23032v1 Announce Type: new Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this… 20 Page 9 of 10 · 500 articles ← Newer Older →