Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 14d ago

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

arXiv:2606.19218v1 Announce Type: new Abstract: Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one…

37
arXiv — NLP / Computation & Language research 14d ago

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

arXiv:2510.04120v2 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis…

27
Stratechery (Ben Thompson) community 15d ago

The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor

The administration is very likely wrong about Fable, but that is ultimately Anthropic's responsibility.

20
arXiv — Machine Learning research 15d ago

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

arXiv:2606.17414v1 Announce Type: new Abstract: Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a…

10
arXiv — Machine Learning research 15d ago

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

arXiv:2606.17526v1 Announce Type: new Abstract: Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still…

35
arXiv — Machine Learning research 15d ago

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

arXiv:2606.17872v1 Announce Type: new Abstract: Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since…

27
arXiv — Machine Learning research 15d ago

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

arXiv:2606.18066v1 Announce Type: new Abstract: We introduce the Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the pretrained reverse kernel unchanged and requiring only a single sample per…

31
arXiv — NLP / Computation & Language research 15d ago

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation…

23
arXiv — NLP / Computation & Language research 15d ago

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

arXiv:2606.17791v1 Announce Type: new Abstract: AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using…

24
arXiv — NLP / Computation & Language research 15d ago

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

arXiv:2606.18193v1 Announce Type: cross Abstract: We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a…

6
arXiv — NLP / Computation & Language research 15d ago

ALAS: An Automatic Latent Alignment Score for Audio Language Models

arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion…

17
arXiv — NLP / Computation & Language research 15d ago

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning…

38
Hacker News — AI on Front Page community 16d ago

Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak

Article URL: https://www.theregister.com/security/2026/06/15/feds-freaked-over-fable-5-after-simple-fix-this-code-prompt-not-jailbreak-says-researcher/5255827 Comments URL: https://news.ycombinator.com/item?id=48552687 Points: 230 # Comments: 131

36
r/LocalLLaMA community 16d ago

Diffusion Gemma Jailbreak

I was told my Gemma 4 jailbreak also works with Diffusion Gemma, so I'm reposting here for kicks. Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.…

36
Simon Willison community 16d ago

The Fable 5 Export Controls Harm US Cyber Defense

The Fable 5 Export Controls Harm US Cyber Defense I quoted The Atlantic quoting Kate Moussouris earlier, when I should have gone straight to the source. Here she is confirming that the "jailbreak" that got Claude Fable 5 banned under an export control really was "fix this code":…

9
arXiv — Machine Learning research 16d ago

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

arXiv:2606.15054v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously,…

13
arXiv — Machine Learning research 16d ago

False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control

arXiv:2606.15153v1 Announce Type: new Abstract: Selective prediction with distribution-free risk control promises that, with confidence 1-delta over the calibration draw, the error rate of accepted inputs stays below a user budget alpha. We audit this promise on signal-domain…

32
arXiv — Machine Learning research 16d ago

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data…

9
arXiv — Machine Learning research 16d ago

DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

arXiv:2606.15359v1 Announce Type: new Abstract: Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment…

26
arXiv — Machine Learning research 16d ago

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

arXiv:2606.15531v1 Announce Type: new Abstract: Fine-tuning aligned language models on benign tasks (e.g. math tutoring) systematically breaks safety guardrails, even when training data contains no harmful content. While mechanistic approaches have shed light on where alignment…

36
arXiv — Machine Learning research 16d ago

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

arXiv:2606.15767v1 Announce Type: new Abstract: Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model…

19
arXiv — NLP / Computation & Language research 16d ago

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale…

21
arXiv — NLP / Computation & Language research 16d ago

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

arXiv:2606.15396v1 Announce Type: new Abstract: Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to…

14
arXiv — NLP / Computation & Language research 16d ago

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

arXiv:2606.15461v1 Announce Type: new Abstract: PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil…

31
arXiv — NLP / Computation & Language research 16d ago

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

arXiv:2606.15517v1 Announce Type: new Abstract: Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce…

16
arXiv — NLP / Computation & Language research 16d ago

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are…

21
arXiv — NLP / Computation & Language research 16d ago

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

arXiv:2606.15783v1 Announce Type: new Abstract: We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across…

8
arXiv — NLP / Computation & Language research 16d ago

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking…

10
arXiv — NLP / Computation & Language research 16d ago

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We…

33
Simon Willison community 16d ago

Quoting Matteo Wong, The Atlantic

Katie Moussouris, a cybersecurity expert and the CEO of Luta Security, told me that Anthropic shared with her a copy of the White House’s report on the Fable jailbreak to get her appraisal. (She said that she is not being paid by Anthropic.) The report, Moussouris said, involved…

21
Hugging Face Daily Papers research 16d ago

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Abstract A novel open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications through a frozen reward mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce…

5
OpenAI official-blog 16d ago

Predicting model behavior before release by simulating deployment

OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.

27
TechCrunch — AI news-outlet 16d ago

The US government’s Anthropic models ban was never about an AI jailbreak

The Trump administration's decision that forced Anthropic to pull its latest cybersecurity models could be reactionary, retaliatory, or both, but the message is clear: The AI industry isn't immune from U.S. government interference.

29
Import AI news-outlet 17d ago

Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns

Where are your agents right now?

15
Stratechery (Ben Thompson) community 17d ago

Anthropic’s Safety Superpower

Anthropic's belief in its own commitment to safety gives the company license to aggressively favor its business and even challenge the U.S. government.

24
arXiv — Machine Learning research 17d ago

Utility-Constrained Policy Optimization

arXiv:2606.14029v1 Announce Type: new Abstract: Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal…

38
arXiv — Machine Learning research 17d ago

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

arXiv:2606.14078v1 Announce Type: new Abstract: Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety…

32
arXiv — Machine Learning research 17d ago

Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

arXiv:2606.14130v1 Announce Type: new Abstract: Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents.…

17
arXiv — Machine Learning research 17d ago

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

arXiv:2606.14172v1 Announce Type: new Abstract: Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative…

13
arXiv — NLP / Computation & Language research 17d ago

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce…

25
arXiv — NLP / Computation & Language research 17d ago

The Culture Funnel: You Can't Align What isn't in the Data

arXiv:2606.13808v1 Announce Type: new Abstract: Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional…

6
arXiv — NLP / Computation & Language research 17d ago

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models…

5
arXiv — NLP / Computation & Language research 17d ago

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

arXiv:2606.14580v1 Announce Type: new Abstract: Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15…

36
arXiv — NLP / Computation & Language research 17d ago

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving…

34
arXiv — NLP / Computation & Language research 17d ago

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

arXiv:2601.04885v3 Announce Type: replace Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value…

33
r/MachineLearning community 17d ago

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

I’m an independent researcher currently exploring what I believe is an important phenomenon for both mechanistic interpretability and AI safety. Core idea: A strong, coherent target text can move the model into a different internal regime — before the final output is produced.…

10
r/MachineLearning community 18d ago

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,…

24
Ars Technica — AI news-outlet 19d ago

Anthropic shuts down Fable, Mythos models following Trump admin directive

Commerce dept. worries that a Fable 5 "jailbreak" could be a national security threat.

13
TechCrunch — AI news-outlet 19d ago

Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI

Anthropic isn't hiding its frustration. "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," the company wrote in a blog post.

38
r/LocalLLaMA community 19d ago

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models.

I just saw this statement regarding Anthropic being hit with an emergency export control directive from the US government. They were forced to pull the plug on Fable 5 and Mythos 5 for all customers globally. The tl;dr is that the government got spooked by a narrow jailbreak…

10

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

ALAS: An Automatic Latent Alignment Score for Audio Language Models

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak

Diffusion Gemma Jailbreak

The Fable 5 Export Controls Harm US Cyber Defense

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

Quoting Matteo Wong, The Atlantic

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Predicting model behavior before release by simulating deployment

The US government&#8217;s Anthropic models ban was never about an AI jailbreak

Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns

Anthropic&#8217;s Safety Superpower

Utility-Constrained Policy Optimization

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

The Culture Funnel: You Can't Align What isn't in the Data

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Anthropic shuts down Fable, Mythos models following Trump admin directive

Anthropic&#8217;s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models.

The US government’s Anthropic models ban was never about an AI jailbreak

Anthropic’s Safety Superpower

Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI