News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 2h ago MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules arXiv:2607.00464v1 Announce Type: cross Abstract: Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many… 22 arXiv — Machine Learning research 2h ago PAPA: Online Personalized Active Preference Alignment arXiv:2607.00486v1 Announce Type: new Abstract: Diffusion models are highly effective at modeling complex data distributions, including images and text. However, in applications like personalized recommender systems, the objective often shifts to modeling specific regions of the… 11 arXiv — Machine Learning research 2h ago Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment arXiv:2607.00603v1 Announce Type: new Abstract: We give a descent-free, alignment-free measurement of singular structure on trained networks. At a single frozen checkpoint the read recovers the order $k$ of each dead direction from the directional-Fisher rate, the master… 34 arXiv — Machine Learning research 2h ago Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization arXiv:2607.00908v1 Announce Type: new Abstract: Mixed-precision quantization (MPQ) has become a key technique for deploying large language models under stringent memory and compute constraints. We first identify a phenomenon that we term the Perplexity Illusion: layers ranked as… 7 arXiv — Machine Learning research 2h ago Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling arXiv:2607.01022v1 Announce Type: new Abstract: Spatiotemporal point processes (STPPs) model event data in continuous time and space, with applications in mobility, epidemiology, and public safety. Recent neural STPPs span expressive intensity models, conditional density models,… 14 arXiv — Machine Learning research 2h ago Sequentially-Controlled Interactive Multi-Particle Flow-Maps for Online Feedback-Driven Search arXiv:2607.01144v1 Announce Type: new Abstract: While generative models have enabled training-free reward alignment, current methods typically excel in local exploration within narrow regions of the underlying distribution. These approaches struggle when preferences are unknown… 16 arXiv — NLP / Computation & Language research 2h ago A Mechanistic View of Authority Hierarchy in LLM Sycophancy arXiv:2607.00415v1 Announce Type: new Abstract: Authority bias poses a critical safety concern in language models: models systematically prioritize social cues from authority figures over factual consistency, swaying their answers based on source credibility rather than… 17 arXiv — NLP / Computation & Language research 2h ago Safe Alone, Unsafe Together: Safeguarding Against Implicit Toxicity When Benign Images Combine arXiv:2607.00576v1 Announce Type: new Abstract: Multi-image content has become an increasingly prevalent form of visual communication in social media, giving rise to a new safety issue, multi-image implicit toxicity (MIIT), where each image appears benign in isolation, but… 15 arXiv — NLP / Computation & Language research 2h ago MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark arXiv:2607.00724v1 Announce Type: new Abstract: Multilingual fluency often invites a stronger assumption: a model that can speak a user's language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption… 8 arXiv — NLP / Computation & Language research 2h ago Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded… 14 r/MachineLearning community 5h ago Making Optimization Work When Labels Are Scarce [R] https://www.gnosyslabs.com/case-studies/safety-classifier-sparse-labels Gnosys is an autonomous model engineer: it improves prompts and classifiers when ground truth is too sparse for conventional optimization. On ToxicChat, a public safety benchmark, under realistic label… 23 Ars Technica — AI news-outlet 13h ago After spooking Trump into safety testing, Anthropic AI models get global release US lifts curbs on Anthropic’s advanced Fable and Mythos models. 31 llama.cpp releases dev-tools 15h ago b9857 hexagon: flash attention rework (optimizations, accuracy improvements, etc) ( #25085 ) hex-mm: fold mm quant tasks into the main matmul threads hex-mm: minor formatting fixes hex-mm: cleanup is_quant checks in dma dispatch hex-mm: fix dst-spad alignment hex-mm: move fp kernels… 5 r/MachineLearning community 20h ago A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P] Prompt injection has emerged as one of the most persistent failure modes in tool-using LLM systems, particularly in agentic workflows where models interact with external data sources. Most mitigation strategies focus on input filtering or model-side alignment, but these… 9 Hugging Face Daily Papers research 21h ago RedVox: Safety and Fairness Gaps in Speech Models Across Languages Abstract Multilingual safety and fairness benchmark for speech models reveals persistent vulnerabilities across languages and naturalistic conditions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-capable models are increasingly deployed in real-world applications across… 36 Vercel — AI dev-tools 23h ago Claude Fable 5 access restored on AI Gateway Access to Claude Fable 5, the Mythos-class model, has now been restored on AI Gateway following the US Government's decision to lift the export controls. Fable 5 is the same model that was available between June 9 and June 12. What has changed is the safety classifiers, which… 27 Hugging Face Daily Papers research 23h ago GEAR: Guided End-to-End AutoRegression for Image Synthesis Abstract GEAR trains a vector-quantized tokenizer and autoregressive generator jointly end-to-end using representation alignment, overcoming non-differentiability issues through a dual read-out approach that improves convergence speed and feature quality. Generated by… 36 arXiv — NLP / Computation & Language research 1d ago Revocable Learned State via Process Sidecars arXiv:2606.30788v1 Announce Type: cross Abstract: Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not… 17 arXiv — Machine Learning research 1d ago Safe Online Learning via Smooth Safety-Structured Policy Composition arXiv:2606.31320v1 Announce Type: new Abstract: Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions,… 7 arXiv — Machine Learning research 1d ago Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images arXiv:2606.31394v1 Announce Type: new Abstract: Artificial intelligence is transforming our capability to solve biological challenges. In dimensionality bottleneck regimes exacerbated by high-dimensional biological data, Neural networks force distinct concepts into the lower… 14 arXiv — Machine Learning research 1d ago On the Convergence of Self-Improving Online LLM Alignment arXiv:2606.31524v1 Announce Type: new Abstract: The Self-Improving Alignment (SAIL) algorithm addresses distribution shift by reducing a bilevel formulation of the problem to an efficient, single-level method. Empirically, SAIL has demonstrated strong performance on this task.… 8 arXiv — Machine Learning research 1d ago Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment arXiv:2606.31591v1 Announce Type: new Abstract: Emergent misalignment (EM) is a recently discovered phenomenon in LLMs where fine-tuning on a narrow misaligned task, such as writing insecure code, leads to broadly misaligned behaviour on unrelated prompts. Previous work has… 11 arXiv — Machine Learning research 1d ago Addressing Over-Refusal in LLMs with Competing Rewards arXiv:2606.31748v1 Announce Type: new Abstract: Safety training on language models often induces over-refusal: improved safety on harmful prompts at the cost of increased refusal on harmless ones. Though this trade-off can be mitigated by training models with reinforcement… 25 arXiv — NLP / Computation & Language research 1d ago Signed-Permutation Coordinate Transport for RMSNorm Transformers arXiv:2606.31963v1 Announce Type: cross Abstract: Modern LLM workflows move coordinate-indexed objects across checkpoints: steering vectors, sparse autoencoders, top-$k$ neuron sets, attribution lists, and merge alignments. This is only well posed after fixing the model's… 37 arXiv — NLP / Computation & Language research 1d ago From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue arXiv:2606.30973v1 Announce Type: new Abstract: Frictive Policy Optimization (FPO; Pustejovsky et al., 2025) treats friction in collaborative dialogue -- misalignment, misunderstanding, repair -- as an epistemic signal essential to common-ground construction, rather than noise… 18 arXiv — NLP / Computation & Language research 1d ago LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment arXiv:2606.31310v1 Announce Type: new Abstract: Fueled by increasing model scale and multimodal inputs, Multimodal Large Language Models (MLLMs) have emerged as a promising paradigm for Spoken Language Assessment (SLA). While effective, this paradigm often overlooks the… 9 arXiv — NLP / Computation & Language research 1d ago Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues arXiv:2606.31644v1 Announce Type: new Abstract: As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations… 5 arXiv — NLP / Computation & Language research 1d ago ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs arXiv:2606.31054v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive… 37 TechCrunch — AI news-outlet 1d ago Anthropic launches Claude Sonnet 5 as a cheaper way to run agents Anthropic’s Claude Sonnet 5 brings stronger agentic capabilities, lower pricing, and improved safety, positioning the model as a cheaper alternative to Opus, GPT-5.5, and Gemini Pro. 5 Hugging Face Daily Papers research 1d ago Mind the Heads: Topological Representation Alignment for Multimodal LLMs Abstract HeRA aligns individual attention heads in MLLMs to preserve local neighborhood relationships across modalities, improving vision-centric task performance and reducing visual hallucinations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation alignment has… 27 Hugging Face Daily Papers research 1d ago A Gravitational Interpretation of Fine-Tuning Reversion Abstract Post-alignment safety degradation arises from geometric properties of training history, where fine-tuning reversion follows a persistent direction defined by early training dynamics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Fine-tuning on harmless data can partially… 35 arXiv — Machine Learning research 2d ago A Gravitational Interpretation of Fine-Tuning Reversion arXiv:2606.28525v1 Announce Type: new Abstract: Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently… 27 arXiv — Machine Learning research 2d ago MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment arXiv:2606.29049v1 Announce Type: new Abstract: Knowledge Tracing (KT) is important for personalized education but traditionally suffers from two key limitations: a reliance on shallow ID-based representations that neglect semantic depth and a restriction to single-granularity… 37 arXiv — Machine Learning research 2d ago Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using… 27 arXiv — Machine Learning research 2d ago Beyond Trajectory Matching: Reflow with Marginal Distribution Alignment arXiv:2606.29287v1 Announce Type: new Abstract: Diffusion and continuous-flow generative models achieve high-quality generation, and their deterministic sampling can be formulated as solving learned ODE dynamics. However, accurate ODE discretization often requires many steps,… 36 arXiv — Machine Learning research 2d ago Do Models Read What They Write? Causal Registers in Scratchpad Reasoning arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a… 29 arXiv — Machine Learning research 2d ago VISTA-DZ: Visual Semantic Trajectory Adaptation for Personalized Dilemma Zone Prediction arXiv:2606.29548v1 Announce Type: new Abstract: Driver decision making in the dilemma zone at signalized intersections is safety critical, as vehicles approaching a yellow signal must decide whether to stop or proceed within limited time and distance margins. Accurate prediction… 38 arXiv — NLP / Computation & Language research 2d ago DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation arXiv:2606.28725v1 Announce Type: new Abstract: Automated toxicity moderation systems operate in dynamic online environments where harmful behavior evolves through coded language, shifting targets, and strategic adaptation to enforcement. Existing drift detection methods often… 12 arXiv — NLP / Computation & Language research 2d ago The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning arXiv:2606.28843v1 Announce Type: new Abstract: Fine-tuning a large language model is a ubiquitous method for enhancing its capability on a specific downstream task. However, prior work has shown that this increase in capability comes with a cost: it can increase a model's… 18 arXiv — NLP / Computation & Language research 2d ago A Hybrid Framework for Song Lyric Annotation Based on Human-LLM Alignment arXiv:2606.29273v1 Announce Type: new Abstract: Emotion recognition of song lyrics is a challenging task since lyrics may not necessarily align with the overall emotion of a song. As a result, lyrics annotation remains largely underexplored. Drawing inspiration from research in… 34 arXiv — NLP / Computation & Language research 2d ago Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages arXiv:2606.29649v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) are increasingly deployed as content moderation tools, yet they remain vulnerable to jailbreak attacks in which harmful text is visually encoded as ASCII art. This can allow inappropriate or… 31 arXiv — NLP / Computation & Language research 2d ago Timesteps of Mamba Align with Human Reading Times arXiv:2606.29904v1 Announce Type: new Abstract: This study demonstrates an alignment of per-word processing time in a popular state-space language model Mamba and human readers. In Mamba, the recurrent state transition at each layer conceptually takes some duration of time, the… 12 arXiv — NLP / Computation & Language research 2d ago Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization arXiv:2606.29933v1 Announce Type: new Abstract: The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and… 16 arXiv — NLP / Computation & Language research 2d ago Node-to-Neighborhood Semantic Consistency: Text-Topology Alignment for TAGs Anomaly Detection arXiv:2606.30009v1 Announce Type: new Abstract: Graph anomaly detection (GAD) on text-attributed graphs (TAGs) is vital for applications such as fraud detection and academic integrity verification. Existing approaches generally fall into two paradigms. GNN-based methods… 36 Hugging Face Daily Papers research 2d ago SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing Abstract SafePyramid benchmark evaluates guardrail systems' ability to identify safety violations through in-context policy specification across multiple domains and complexity levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world applications, guardrails are often… 5 arXiv — Machine Learning research 3d ago RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance arXiv:2606.27766v1 Announce Type: new Abstract: Offline reinforcement learning enables policy learning from fixed datasets without additional environment interaction, making it appealing for safety-critical applications where online exploration is costly or unsafe.… 32 arXiv — Machine Learning research 3d ago NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning arXiv:2606.27771v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of… 8 arXiv — Machine Learning research 3d ago OperatorSHAP: Fast and Accurate Shapley Value Estimation for Neural Operators arXiv:2606.28065v1 Announce Type: new Abstract: Understanding model predictions is essential for physical applications, where outputs often inform safety-critical decisions, such as structural load assessment, weather warnings, and clinical diagnosis. Shapley values satisfy many… 20 arXiv — Machine Learning research 3d ago Democratic ICAI: Debating Our Way to Steering Principles from Preferences arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the… 38 arXiv — NLP / Computation & Language research 3d ago Position: The Term "Machine Unlearning" Is Overused in LLMs arXiv:2606.27379v1 Announce Type: new Abstract: Large language models increasingly face demands to "forget" training data, knowledge, or behaviors due to regulatory deletion obligations, copyright/licensing disputes, and safety or product-policy requirements. This position paper… 15 Page 1 of 10 · 500 articles Older →