News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 14d ago RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering arXiv:2606.19218v1 Announce Type: new Abstract: Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one… 37 arXiv — NLP / Computation & Language research 14d ago Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing arXiv:2510.04120v2 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis… 27 Stratechery (Ben Thompson) community 15d ago The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor The administration is very likely wrong about Fable, but that is ultimately Anthropic's responsibility. 20 arXiv — Machine Learning research 15d ago Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations arXiv:2606.17414v1 Announce Type: new Abstract: Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a… 10 arXiv — Machine Learning research 15d ago MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization arXiv:2606.17526v1 Announce Type: new Abstract: Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still… 35 arXiv — Machine Learning research 15d ago AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor arXiv:2606.17872v1 Announce Type: new Abstract: Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since… 27 arXiv — Machine Learning research 15d ago NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment arXiv:2606.18066v1 Announce Type: new Abstract: We introduce the Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the pretrained reverse kernel unchanged and requiring only a single sample per… 31 arXiv — NLP / Computation & Language research 15d ago Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation… 23 arXiv — NLP / Computation & Language research 15d ago The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports arXiv:2606.17791v1 Announce Type: new Abstract: AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using… 24 arXiv — NLP / Computation & Language research 15d ago A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models arXiv:2606.18193v1 Announce Type: cross Abstract: We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a… 6 arXiv — NLP / Computation & Language research 15d ago ALAS: An Automatic Latent Alignment Score for Audio Language Models arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion… 17 arXiv — NLP / Computation & Language research 15d ago EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning… 38 Hacker News — AI on Front Page community 16d ago Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak Article URL: https://www.theregister.com/security/2026/06/15/feds-freaked-over-fable-5-after-simple-fix-this-code-prompt-not-jailbreak-says-researcher/5255827 Comments URL: https://news.ycombinator.com/item?id=48552687 Points: 230 # Comments: 131 36 r/LocalLLaMA community 16d ago Diffusion Gemma Jailbreak I was told my Gemma 4 jailbreak also works with Diffusion Gemma, so I'm reposting here for kicks. Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.… 36 Simon Willison community 16d ago The Fable 5 Export Controls Harm US Cyber Defense The Fable 5 Export Controls Harm US Cyber Defense I quoted The Atlantic quoting Kate Moussouris earlier, when I should have gone straight to the source. Here she is confirming that the "jailbreak" that got Claude Fable 5 banned under an export control really was "fix this code":… 9 arXiv — Machine Learning research 16d ago Size Doesn't Matter: Cosine-Scored Sparse Autoencoders arXiv:2606.15054v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously,… 13 arXiv — Machine Learning research 16d ago False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control arXiv:2606.15153v1 Announce Type: new Abstract: Selective prediction with distribution-free risk control promises that, with confidence 1-delta over the calibration draw, the error rate of accepted inputs stays below a user budget alpha. We audit this promise on signal-domain… 32 arXiv — Machine Learning research 16d ago EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data… 9 arXiv — Machine Learning research 16d ago DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising arXiv:2606.15359v1 Announce Type: new Abstract: Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment… 26 arXiv — Machine Learning research 16d ago Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance arXiv:2606.15531v1 Announce Type: new Abstract: Fine-tuning aligned language models on benign tasks (e.g. math tutoring) systematically breaks safety guardrails, even when training data contains no harmful content. While mechanistic approaches have shed light on where alignment… 36 arXiv — Machine Learning research 16d ago Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning arXiv:2606.15767v1 Announce Type: new Abstract: Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model… 19 arXiv — NLP / Computation & Language research 16d ago CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale… 21 arXiv — NLP / Computation & Language research 16d ago CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment arXiv:2606.15396v1 Announce Type: new Abstract: Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to… 14 arXiv — NLP / Computation & Language research 16d ago ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking arXiv:2606.15461v1 Announce Type: new Abstract: PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil… 31 arXiv — NLP / Computation & Language research 16d ago SHARD: Safe and Helpful Alignment via Self-Reframing Distillation arXiv:2606.15517v1 Announce Type: new Abstract: Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce… 16 arXiv — NLP / Computation & Language research 16d ago Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are… 21 arXiv — NLP / Computation & Language research 16d ago ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment arXiv:2606.15783v1 Announce Type: new Abstract: We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across… 8 arXiv — NLP / Computation & Language research 16d ago Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking… 10 arXiv — NLP / Computation & Language research 16d ago AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We… 33 Simon Willison community 16d ago Quoting Matteo Wong, The Atlantic Katie Moussouris, a cybersecurity expert and the CEO of Luta Security, told me that Anthropic shared with her a copy of the White House’s report on the Fable jailbreak to get her appraisal. (She said that she is not being paid by Anthropic.) The report, Moussouris said, involved… 21 Hugging Face Daily Papers research 16d ago TuneJury: An Open Metric for Improving Music Generation Preference Alignment Abstract A novel open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications through a frozen reward mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce… 5 OpenAI official-blog 16d ago Predicting model behavior before release by simulating deployment OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy. 27 TechCrunch — AI news-outlet 16d ago The US government’s Anthropic models ban was never about an AI jailbreak The Trump administration's decision that forced Anthropic to pull its latest cybersecurity models could be reactionary, retaliatory, or both, but the message is clear: The AI industry isn't immune from U.S. government interference. 29 Import AI news-outlet 17d ago Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns Where are your agents right now? 15 Stratechery (Ben Thompson) community 17d ago Anthropic’s Safety Superpower Anthropic's belief in its own commitment to safety gives the company license to aggressively favor its business and even challenge the U.S. government. 24 arXiv — Machine Learning research 17d ago Utility-Constrained Policy Optimization arXiv:2606.14029v1 Announce Type: new Abstract: Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal… 38 arXiv — Machine Learning research 17d ago Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning arXiv:2606.14078v1 Announce Type: new Abstract: Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety… 32 arXiv — Machine Learning research 17d ago Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning arXiv:2606.14130v1 Announce Type: new Abstract: Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents.… 17 arXiv — Machine Learning research 17d ago Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs arXiv:2606.14172v1 Announce Type: new Abstract: Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative… 13 arXiv — NLP / Computation & Language research 17d ago Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce… 25 arXiv — NLP / Computation & Language research 17d ago The Culture Funnel: You Can't Align What isn't in the Data arXiv:2606.13808v1 Announce Type: new Abstract: Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional… 6 arXiv — NLP / Computation & Language research 17d ago Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models… 5 arXiv — NLP / Computation & Language research 17d ago Persuasion Index: A Theory-Guided Framework for Persuasion Analysis arXiv:2606.14580v1 Announce Type: new Abstract: Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15… 36 arXiv — NLP / Computation & Language research 17d ago CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving… 34 arXiv — NLP / Computation & Language research 17d ago CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters arXiv:2601.04885v3 Announce Type: replace Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value… 33 r/MachineLearning community 17d ago Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D] I’m an independent researcher currently exploring what I believe is an important phenomenon for both mechanistic interpretability and AI safety. Core idea: A strong, coherent target text can move the model into a different internal regime — before the final output is produced.… 10 r/MachineLearning community 18d ago The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R] We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,… 24 Ars Technica — AI news-outlet 19d ago Anthropic shuts down Fable, Mythos models following Trump admin directive Commerce dept. worries that a Fable 5 "jailbreak" could be a national security threat. 13 TechCrunch — AI news-outlet 19d ago Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI Anthropic isn't hiding its frustration. "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," the company wrote in a blog post. 38 r/LocalLLaMA community 19d ago Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. I just saw this statement regarding Anthropic being hit with an emergency export control directive from the US government. They were forced to pull the plug on Fable 5 and Mythos 5 for all customers globally. The tl;dr is that the government got spooked by a narrow jailbreak… 10 Page 4 of 10 · 500 articles ← Newer Older →