Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 2h ago

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

arXiv:2607.00464v1 Announce Type: cross Abstract: Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many…

22
arXiv — Machine Learning research 2h ago

PAPA: Online Personalized Active Preference Alignment

arXiv:2607.00486v1 Announce Type: new Abstract: Diffusion models are highly effective at modeling complex data distributions, including images and text. However, in applications like personalized recommender systems, the objective often shifts to modeling specific regions of the…

11
arXiv — Machine Learning research 2h ago

Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

arXiv:2607.00603v1 Announce Type: new Abstract: We give a descent-free, alignment-free measurement of singular structure on trained networks. At a single frozen checkpoint the read recovers the order $k$ of each dead direction from the directional-Fisher rate, the master…

34
arXiv — Machine Learning research 2h ago

Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization

arXiv:2607.00908v1 Announce Type: new Abstract: Mixed-precision quantization (MPQ) has become a key technique for deploying large language models under stringent memory and compute constraints. We first identify a phenomenon that we term the Perplexity Illusion: layers ranked as…

7
arXiv — Machine Learning research 2h ago

Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling

arXiv:2607.01022v1 Announce Type: new Abstract: Spatiotemporal point processes (STPPs) model event data in continuous time and space, with applications in mobility, epidemiology, and public safety. Recent neural STPPs span expressive intensity models, conditional density models,…

14
arXiv — Machine Learning research 2h ago

Sequentially-Controlled Interactive Multi-Particle Flow-Maps for Online Feedback-Driven Search

arXiv:2607.01144v1 Announce Type: new Abstract: While generative models have enabled training-free reward alignment, current methods typically excel in local exploration within narrow regions of the underlying distribution. These approaches struggle when preferences are unknown…

16
arXiv — NLP / Computation & Language research 2h ago

A Mechanistic View of Authority Hierarchy in LLM Sycophancy

arXiv:2607.00415v1 Announce Type: new Abstract: Authority bias poses a critical safety concern in language models: models systematically prioritize social cues from authority figures over factual consistency, swaying their answers based on source credibility rather than…

17
arXiv — NLP / Computation & Language research 2h ago

Safe Alone, Unsafe Together: Safeguarding Against Implicit Toxicity When Benign Images Combine

arXiv:2607.00576v1 Announce Type: new Abstract: Multi-image content has become an increasingly prevalent form of visual communication in social media, giving rise to a new safety issue, multi-image implicit toxicity (MIIT), where each image appears benign in isolation, but…

15
arXiv — NLP / Computation & Language research 2h ago

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

arXiv:2607.00724v1 Announce Type: new Abstract: Multilingual fluency often invites a stronger assumption: a model that can speak a user's language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption…

8
arXiv — NLP / Computation & Language research 2h ago

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

arXiv:2607.01153v1 Announce Type: new Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded…

14
r/MachineLearning community 5h ago

Making Optimization Work When Labels Are Scarce [R]

https://www.gnosyslabs.com/case-studies/safety-classifier-sparse-labels Gnosys is an autonomous model engineer: it improves prompts and classifiers when ground truth is too sparse for conventional optimization. On ToxicChat, a public safety benchmark, under realistic label…

23
Ars Technica — AI news-outlet 13h ago

After spooking Trump into safety testing, Anthropic AI models get global release

US lifts curbs on Anthropic’s advanced Fable and Mythos models.

31
llama.cpp releases dev-tools 15h ago

b9857

hexagon: flash attention rework (optimizations, accuracy improvements, etc) ( #25085 ) hex-mm: fold mm quant tasks into the main matmul threads hex-mm: minor formatting fixes hex-mm: cleanup is_quant checks in dma dispatch hex-mm: fix dst-spad alignment hex-mm: move fp kernels…

5
r/MachineLearning community 20h ago

A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]

Prompt injection has emerged as one of the most persistent failure modes in tool-using LLM systems, particularly in agentic workflows where models interact with external data sources. Most mitigation strategies focus on input filtering or model-side alignment, but these…

9
Hugging Face Daily Papers research 21h ago

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Abstract Multilingual safety and fairness benchmark for speech models reveals persistent vulnerabilities across languages and naturalistic conditions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-capable models are increasingly deployed in real-world applications across…

36
Vercel — AI dev-tools 23h ago

Claude Fable 5 access restored on AI Gateway

Access to Claude Fable 5, the Mythos-class model, has now been restored on AI Gateway following the US Government's decision to lift the export controls. Fable 5 is the same model that was available between June 9 and June 12. What has changed is the safety classifiers, which…

27
Hugging Face Daily Papers research 23h ago

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Abstract GEAR trains a vector-quantized tokenizer and autoregressive generator jointly end-to-end using representation alignment, overcoming non-differentiability issues through a dual read-out approach that improves convergence speed and feature quality. Generated by…

36
arXiv — NLP / Computation & Language research 1d ago

Revocable Learned State via Process Sidecars

arXiv:2606.30788v1 Announce Type: cross Abstract: Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not…

17
arXiv — Machine Learning research 1d ago

Safe Online Learning via Smooth Safety-Structured Policy Composition

arXiv:2606.31320v1 Announce Type: new Abstract: Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions,…

7
arXiv — Machine Learning research 1d ago

Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

arXiv:2606.31394v1 Announce Type: new Abstract: Artificial intelligence is transforming our capability to solve biological challenges. In dimensionality bottleneck regimes exacerbated by high-dimensional biological data, Neural networks force distinct concepts into the lower…

14
arXiv — Machine Learning research 1d ago

On the Convergence of Self-Improving Online LLM Alignment

arXiv:2606.31524v1 Announce Type: new Abstract: The Self-Improving Alignment (SAIL) algorithm addresses distribution shift by reducing a bilevel formulation of the problem to an efficient, single-level method. Empirically, SAIL has demonstrated strong performance on this task.…

8
arXiv — Machine Learning research 1d ago

Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment

arXiv:2606.31591v1 Announce Type: new Abstract: Emergent misalignment (EM) is a recently discovered phenomenon in LLMs where fine-tuning on a narrow misaligned task, such as writing insecure code, leads to broadly misaligned behaviour on unrelated prompts. Previous work has…

11
arXiv — Machine Learning research 1d ago

Addressing Over-Refusal in LLMs with Competing Rewards

arXiv:2606.31748v1 Announce Type: new Abstract: Safety training on language models often induces over-refusal: improved safety on harmful prompts at the cost of increased refusal on harmless ones. Though this trade-off can be mitigated by training models with reinforcement…

25
arXiv — NLP / Computation & Language research 1d ago

Signed-Permutation Coordinate Transport for RMSNorm Transformers

arXiv:2606.31963v1 Announce Type: cross Abstract: Modern LLM workflows move coordinate-indexed objects across checkpoints: steering vectors, sparse autoencoders, top-$k$ neuron sets, attribution lists, and merge alignments. This is only well posed after fixing the model's…

37
arXiv — NLP / Computation & Language research 1d ago

From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue

arXiv:2606.30973v1 Announce Type: new Abstract: Frictive Policy Optimization (FPO; Pustejovsky et al., 2025) treats friction in collaborative dialogue -- misalignment, misunderstanding, repair -- as an epistemic signal essential to common-ground construction, rather than noise…

18
arXiv — NLP / Computation & Language research 1d ago

LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

arXiv:2606.31310v1 Announce Type: new Abstract: Fueled by increasing model scale and multimodal inputs, Multimodal Large Language Models (MLLMs) have emerged as a promising paradigm for Spoken Language Assessment (SLA). While effective, this paradigm often overlooks the…

9
arXiv — NLP / Computation & Language research 1d ago

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

arXiv:2606.31644v1 Announce Type: new Abstract: As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations…

5
arXiv — NLP / Computation & Language research 1d ago

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

arXiv:2606.31054v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive…

37
TechCrunch — AI news-outlet 1d ago

Anthropic launches Claude Sonnet 5 as a cheaper way to run agents

Anthropic’s Claude Sonnet 5 brings stronger agentic capabilities, lower pricing, and improved safety, positioning the model as a cheaper alternative to Opus, GPT-5.5, and Gemini Pro.

5
Hugging Face Daily Papers research 1d ago

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Abstract HeRA aligns individual attention heads in MLLMs to preserve local neighborhood relationships across modalities, improving vision-centric task performance and reducing visual hallucinations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation alignment has…

27
Hugging Face Daily Papers research 1d ago

A Gravitational Interpretation of Fine-Tuning Reversion

Abstract Post-alignment safety degradation arises from geometric properties of training history, where fine-tuning reversion follows a persistent direction defined by early training dynamics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Fine-tuning on harmless data can partially…

35
arXiv — Machine Learning research 2d ago

A Gravitational Interpretation of Fine-Tuning Reversion

arXiv:2606.28525v1 Announce Type: new Abstract: Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently…

27
arXiv — Machine Learning research 2d ago

MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

arXiv:2606.29049v1 Announce Type: new Abstract: Knowledge Tracing (KT) is important for personalized education but traditionally suffers from two key limitations: a reliance on shallow ID-based representations that neglect semantic depth and a restriction to single-granularity…

37
arXiv — Machine Learning research 2d ago

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using…

27
arXiv — Machine Learning research 2d ago

Beyond Trajectory Matching: Reflow with Marginal Distribution Alignment

arXiv:2606.29287v1 Announce Type: new Abstract: Diffusion and continuous-flow generative models achieve high-quality generation, and their deterministic sampling can be formulated as solving learned ODE dynamics. However, accurate ODE discretization often requires many steps,…

36
arXiv — Machine Learning research 2d ago

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a…

29
arXiv — Machine Learning research 2d ago

VISTA-DZ: Visual Semantic Trajectory Adaptation for Personalized Dilemma Zone Prediction

arXiv:2606.29548v1 Announce Type: new Abstract: Driver decision making in the dilemma zone at signalized intersections is safety critical, as vehicles approaching a yellow signal must decide whether to stop or proceed within limited time and distance margins. Accurate prediction…

38
arXiv — NLP / Computation & Language research 2d ago

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

arXiv:2606.28725v1 Announce Type: new Abstract: Automated toxicity moderation systems operate in dynamic online environments where harmful behavior evolves through coded language, shifting targets, and strategic adaptation to enforcement. Existing drift detection methods often…

12
arXiv — NLP / Computation & Language research 2d ago

The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning

arXiv:2606.28843v1 Announce Type: new Abstract: Fine-tuning a large language model is a ubiquitous method for enhancing its capability on a specific downstream task. However, prior work has shown that this increase in capability comes with a cost: it can increase a model's…

18
arXiv — NLP / Computation & Language research 2d ago

A Hybrid Framework for Song Lyric Annotation Based on Human-LLM Alignment

arXiv:2606.29273v1 Announce Type: new Abstract: Emotion recognition of song lyrics is a challenging task since lyrics may not necessarily align with the overall emotion of a song. As a result, lyrics annotation remains largely underexplored. Drawing inspiration from research in…

34
arXiv — NLP / Computation & Language research 2d ago

Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages

arXiv:2606.29649v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) are increasingly deployed as content moderation tools, yet they remain vulnerable to jailbreak attacks in which harmful text is visually encoded as ASCII art. This can allow inappropriate or…

31
arXiv — NLP / Computation & Language research 2d ago

Timesteps of Mamba Align with Human Reading Times

arXiv:2606.29904v1 Announce Type: new Abstract: This study demonstrates an alignment of per-word processing time in a popular state-space language model Mamba and human readers. In Mamba, the recurrent state transition at each layer conceptually takes some duration of time, the…

12
arXiv — NLP / Computation & Language research 2d ago

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

arXiv:2606.29933v1 Announce Type: new Abstract: The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and…

16
arXiv — NLP / Computation & Language research 2d ago

Node-to-Neighborhood Semantic Consistency: Text-Topology Alignment for TAGs Anomaly Detection

arXiv:2606.30009v1 Announce Type: new Abstract: Graph anomaly detection (GAD) on text-attributed graphs (TAGs) is vital for applications such as fraud detection and academic integrity verification. Existing approaches generally fall into two paradigms. GNN-based methods…

36
Hugging Face Daily Papers research 2d ago

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

Abstract SafePyramid benchmark evaluates guardrail systems' ability to identify safety violations through in-context policy specification across multiple domains and complexity levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world applications, guardrails are often…

5
arXiv — Machine Learning research 3d ago

RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance

arXiv:2606.27766v1 Announce Type: new Abstract: Offline reinforcement learning enables policy learning from fixed datasets without additional environment interaction, making it appealing for safety-critical applications where online exploration is costly or unsafe.…

32
arXiv — Machine Learning research 3d ago

NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

arXiv:2606.27771v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of…

8
arXiv — Machine Learning research 3d ago

OperatorSHAP: Fast and Accurate Shapley Value Estimation for Neural Operators

arXiv:2606.28065v1 Announce Type: new Abstract: Understanding model predictions is essential for physical applications, where outputs often inform safety-critical decisions, such as structural load assessment, weather warnings, and clinical diagnosis. Shapley values satisfy many…

20
arXiv — Machine Learning research 3d ago

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the…

38
arXiv — NLP / Computation & Language research 3d ago

Position: The Term "Machine Unlearning" Is Overused in LLMs

arXiv:2606.27379v1 Announce Type: new Abstract: Large language models increasingly face demands to "forget" training data, knowledge, or behaviors due to regulatory deletion obligations, copyright/licensing disputes, and safety or product-policy requirements. This position paper…

15

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

PAPA: Online Personalized Active Preference Alignment

Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization

Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling

Sequentially-Controlled Interactive Multi-Particle Flow-Maps for Online Feedback-Driven Search

A Mechanistic View of Authority Hierarchy in LLM Sycophancy

Safe Alone, Unsafe Together: Safeguarding Against Implicit Toxicity When Benign Images Combine

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Making Optimization Work When Labels Are Scarce [R]

After spooking Trump into safety testing, Anthropic AI models get global release

b9857

A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Claude Fable 5 access restored on AI Gateway

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Revocable Learned State via Process Sidecars

Safe Online Learning via Smooth Safety-Structured Policy Composition

Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

On the Convergence of Self-Improving Online LLM Alignment

Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment

Addressing Over-Refusal in LLMs with Competing Rewards

Signed-Permutation Coordinate Transport for RMSNorm Transformers

From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue

LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

Anthropic launches Claude Sonnet 5 as a cheaper way to run agents

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

A Gravitational Interpretation of Fine-Tuning Reversion

A Gravitational Interpretation of Fine-Tuning Reversion

MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

Beyond Trajectory Matching: Reflow with Marginal Distribution Alignment

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

VISTA-DZ: Visual Semantic Trajectory Adaptation for Personalized Dilemma Zone Prediction

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning

A Hybrid Framework for Song Lyric Annotation Based on Human-LLM Alignment

Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages

Timesteps of Mamba Align with Human Reading Times

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

Node-to-Neighborhood Semantic Consistency: Text-Topology Alignment for TAGs Anomaly Detection

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance

NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

OperatorSHAP: Fast and Accurate Shapley Value Estimation for Neural Operators

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

Position: The Term "Machine Unlearning" Is Overused in LLMs