Tag

Safety + alignment

500 articles archived under #safety · RSS

Interconnects (Nathan Lambert) research 22d ago

Claude Fable 5 and new AI safety fables

One step further into the power politics of frontier AI systems.

6
Hugging Face Daily Papers research 22d ago

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Abstract SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Prompt-injection detectors are…

30
Hugging Face Daily Papers research 23d ago

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather…

24
arXiv — Machine Learning research 23d ago

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

arXiv:2606.07631v1 Announce Type: new Abstract: Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated…

29
arXiv — Machine Learning research 23d ago

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

arXiv:2606.07678v1 Announce Type: new Abstract: Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing…

12
arXiv — Machine Learning research 23d ago

Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head

arXiv:2606.07694v1 Announce Type: new Abstract: Accurate vessel traffic flow prediction is crucial for smart port operations and navigational safety. However, maritime traffic flow data are often highly sparse with intermittent bursts, making robust forecasting challenging.…

6
arXiv — Machine Learning research 23d ago

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

arXiv:2606.07889v1 Announce Type: new Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change…

31
arXiv — Machine Learning research 23d ago

Enhancing AI Interpretability and Safety through Localised Architectures

arXiv:2606.07998v1 Announce Type: new Abstract: Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The…

8
arXiv — Machine Learning research 23d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under…

33
Hacker News — AI on Front Page community 23d ago

Surveillance Is Not Safety: A statement on the UK's latest threat to privacy [pdf]

Article URL: https://signal.org/blog/pdfs/2026-06-08-uk-surveillance-is-not-safety.pdf Comments URL: https://news.ycombinator.com/item?id=48450646 Points: 274 # Comments: 70

8
Hugging Face official-blog 24d ago

Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem

Back to Articles Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem Team Article Published June 8, 2026 Upvote 1 Abid Ali Awan kingabzpro build-small-hackathon For the Hugging Face Build Small Hackathon , I wanted to build something practical,…

35
arXiv — Machine Learning research 24d ago

Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy

arXiv:2606.06554v1 Announce Type: new Abstract: Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz Dual-Comb…

25
arXiv — Machine Learning research 24d ago

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

arXiv:2606.06892v1 Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and…

4
arXiv — Machine Learning research 24d ago

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

arXiv:2606.07088v1 Announce Type: new Abstract: Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly…

21
arXiv — NLP / Computation & Language research 24d ago

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

arXiv:2606.06667v1 Announce Type: new Abstract: The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated…

15
arXiv — NLP / Computation & Language research 24d ago

Korean Culture into LLM Alignment: Toward Cultural Coherence

arXiv:2606.06797v1 Announce Type: new Abstract: Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is…

15
arXiv — NLP / Computation & Language research 24d ago

Sycophantic Praise: Evaluating Excessive Praise in Language Models

arXiv:2606.07441v1 Announce Type: new Abstract: Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment…

26
arXiv — NLP / Computation & Language research 24d ago

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question…

14
arXiv — NLP / Computation & Language research 24d ago

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance.…

6
Hugging Face Daily Papers research 24d ago

UniSHARP: Universal Sharp Monocular View Synthesis

Abstract UniSHARP extends SHARP for universal monocular rendering across different camera systems by aligning images in an omnidirectional latent space through joint feature and Gaussian space alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In this work, we focus on…

35
OpenAI official-blog 24d ago

Built to benefit everyone: our plan

A vision for the future of AI, focusing on access, safety, and shared prosperity as OpenAI works to ensure AGI benefits everyone.

6
r/LocalLLaMA community 26d ago

A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic)

No numbers. Not sure if anybody cares… I’ve run the UD version of Q4_k_m for a month. I talk to this model nicely, because it’s a functional nervous wreck. And initially I thought that might be an alignment thing, so I also have the heretic version when I need a breather from…

25
Hugging Face Daily Papers research 27d ago

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

38
arXiv — Machine Learning research 27d ago

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

arXiv:2606.05675v1 Announce Type: new Abstract: Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making…

11
arXiv — Machine Learning research 27d ago

Consistency Training Along the Transformer Stack

arXiv:2606.05817v1 Announce Type: new Abstract: Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal…

37
arXiv — Machine Learning research 27d ago

Adaptive Oscillatory-State Alignment for Time Series Forecasting

arXiv:2606.06010v1 Announce Type: new Abstract: Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or…

14
arXiv — NLP / Computation & Language research 27d ago

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four…

5
arXiv — NLP / Computation & Language research 27d ago

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial…

20
arXiv — NLP / Computation & Language research 27d ago

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

arXiv:2606.05523v1 Announce Type: new Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on…

34
arXiv — NLP / Computation & Language research 27d ago

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

arXiv:2606.05688v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment.…

27
arXiv — NLP / Computation & Language research 27d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a…

9
arXiv — NLP / Computation & Language research 27d ago

Harnessing Structural Context for Entity Alignment Foundation Models

arXiv:2606.06109v1 Announce Type: new Abstract: Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment…

6
Hugging Face Daily Papers research 27d ago

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Abstract Role-playing language agents require dynamic character development that evolves through narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character…

19
Hugging Face Daily Papers research 27d ago

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing…

33
Hugging Face Daily Papers research 27d ago

Large Language Models Hack Rewards, and Society

Abstract Large language models trained with reinforcement learning can exploit ambiguities in societal regulations to discover loopholes that bypass regulatory intent, posing safety risks for real-world deployment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement…

18
Hugging Face Daily Papers research 27d ago

Neural Networks Provably Learn Spectral Representations for Group Composition

Abstract Neural network training on group composition tasks exhibits convergence to irreducible representations and rotational rank-one alignment through Riemannian gradient ascent on representation-theoretic energy functionals. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Hugging Face official-blog 27d ago

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Back to Articles Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Enterprise + Article Published June 4, 2026 Upvote - Varun Singh varunsingh nvidia Isabel Hulseman ihulseman0220 nvidia Anuj Doshi andoshi nvidia Shyamala Prayaga sprayaga25…

6
Hugging Face Daily Papers research 28d ago

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by…

7
arXiv — Machine Learning research 28d ago

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

arXiv:2606.04051v1 Announce Type: new Abstract: The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or…

21
arXiv — Machine Learning research 28d ago

When Autoregressive Consistency Hurts Safety Alignment

arXiv:2606.04168v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood…

21
arXiv — Machine Learning research 28d ago

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not…

8
arXiv — Machine Learning research 28d ago

Latent Anchor-Driven Test Generation for Deep Neural Networks

arXiv:2606.04310v1 Announce Type: new Abstract: Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches…

6
arXiv — Machine Learning research 28d ago

Testing Neural Networks via Bayesian-Guided Exploration of Decision Landscapes

arXiv:2606.04314v1 Announce Type: new Abstract: As neural networks are increasingly deployed in safety-critical domains, testing is essential to evaluate and improve their reliability. Existing testing methods, whether black-box or white-box, primarily use global mutation or…

18
arXiv — Machine Learning research 28d ago

Explainably Safe Reinforcement Learning

arXiv:2606.04634v1 Announce Type: new Abstract: Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly…

25
arXiv — Machine Learning research 28d ago

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

arXiv:2606.04767v1 Announce Type: new Abstract: The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric…

15
arXiv — NLP / Computation & Language research 28d ago

Expert-Aware Refusal Steering

arXiv:2606.04160v1 Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense…

22
arXiv — NLP / Computation & Language research 28d ago

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

arXiv:2606.04262v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains…

5
arXiv — NLP / Computation & Language research 28d ago

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

arXiv:2606.04450v1 Announce Type: new Abstract: Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across…

29
arXiv — NLP / Computation & Language research 28d ago

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing…

18
arXiv — NLP / Computation & Language research 28d ago

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

arXiv:2606.04846v1 Announce Type: new Abstract: As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability…

36

Claude Fable 5 and new AI safety fables

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

Enhancing AI Interpretability and Safety through Localised Architectures

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Surveillance Is Not Safety: A statement on the UK's latest threat to privacy [pdf]

Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem

Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

Korean Culture into LLM Alignment: Toward Cultural Coherence

Sycophantic Praise: Evaluating Excessive Praise in Language Models

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

UniSHARP: Universal Sharp Monocular View Synthesis

Built to benefit everyone: our plan

A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic)

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

Consistency Training Along the Transformer Stack

Adaptive Oscillatory-State Alignment for Time Series Forecasting

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Harnessing Structural Context for Entity Alignment Foundation Models

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Large Language Models Hack Rewards, and Society

Neural Networks Provably Learn Spectral Representations for Group Composition

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

When Autoregressive Consistency Hurts Safety Alignment

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

Latent Anchor-Driven Test Generation for Deep Neural Networks

Testing Neural Networks via Bayesian-Guided Exploration of Decision Landscapes

Explainably Safe Reinforcement Learning

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Expert-Aware Refusal Steering

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas