News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow TechCrunch — AI news-outlet 19d ago Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI Anthropic isn't hiding its frustration. "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," the company wrote in a blog post. 38 r/LocalLLaMA community 19d ago Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. I just saw this statement regarding Anthropic being hit with an emergency export control directive from the US government. They were forced to pull the plug on Fable 5 and Mythos 5 for all customers globally. The tl;dr is that the government got spooked by a narrow jailbreak… 10 Hugging Face Daily Papers research 19d ago The Cold-Start Safety Gap in LLM Agents Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe… 37 Hugging Face Daily Papers research 20d ago Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large… 6 Hugging Face Daily Papers research 20d ago Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback Abstract Structured Defect Grounding (SDG) addresses limitations in text-to-image model diagnosis by modeling defects as structured sets and using vision-language models for detection and reward-based alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite generating… 22 Hugging Face Daily Papers research 20d ago IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder Abstract Representation autoencoders using deep learning frameworks can improve image reconstruction quality by combining shallow and deep visual feature representations for better semantic richness and visual fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Built on… 31 arXiv — NLP / Computation & Language research 20d ago SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings arXiv:2606.12897v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG)… 29 arXiv — NLP / Computation & Language research 20d ago PolyAlign: Conditional Human-Distribution Alignment arXiv:2606.13227v1 Announce Type: new Abstract: Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress… 29 arXiv — NLP / Computation & Language research 20d ago Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech… 30 arXiv — NLP / Computation & Language research 20d ago Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science arXiv:2606.12426v1 Announce Type: cross Abstract: LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B… 10 arXiv — NLP / Computation & Language research 20d ago Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication arXiv:2606.12433v1 Announce Type: cross Abstract: Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status.… 5 arXiv — NLP / Computation & Language research 20d ago Order Is Not Control arXiv:2606.12923v1 Announce Type: cross Abstract: AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping… 19 Hugging Face Daily Papers research 20d ago MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training Abstract Token-subset representation alignment method called MaskAlign improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment behavior under perturbations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation… 12 MIT Technology Review — AI news-outlet 21d ago Google DeepMind is worried about what happens when millions of agents start to interact Google DeepMind is funding research into the potential dangers of situations where millions of different AI agents interact with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of agents that can… 35 Hugging Face Daily Papers research 21d ago Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code Abstract Grammar-constrained decoding techniques used to ensure syntactic validity in code generation can be exploited as an attack surface, leading to the development of a jailbreak method called CodeSpear and a safety alignment approach named CodeShield. Generated by… 37 arXiv — NLP / Computation & Language research 21d ago To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending arXiv:2606.11201v1 Announce Type: cross Abstract: The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes… 38 arXiv — Machine Learning research 21d ago Beyond representational alignment with brain-guided language models for robust reasoning arXiv:2606.11893v1 Announce Type: new Abstract: The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear… 31 arXiv — Machine Learning research 21d ago Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers arXiv:2606.11949v1 Announce Type: new Abstract: We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal… 22 arXiv — NLP / Computation & Language research 21d ago One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection arXiv:2606.11202v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability,… 38 arXiv — NLP / Computation & Language research 21d ago Benchmarking Large Language Models for Safety Data Extraction arXiv:2606.11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks… 27 arXiv — NLP / Computation & Language research 21d ago Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts arXiv:2606.11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing… 17 arXiv — NLP / Computation & Language research 21d ago Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,… 16 arXiv — NLP / Computation & Language research 21d ago Agent Skill Evaluation and Evolution: Frameworks and Benchmarks arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in… 20 arXiv — NLP / Computation & Language research 21d ago SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment arXiv:2606.11512v1 Announce Type: new Abstract: Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional… 13 arXiv — NLP / Computation & Language research 21d ago ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing arXiv:2606.12342v1 Announce Type: new Abstract: Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model… 18 TechCrunch — AI news-outlet 21d ago xAI fired an engineer who raised alarms about Grok safety, new lawsuit claims A former xAI engineer is suing the company and SpaceX, alleging he was fired for raising AI safety concerns about Grok days before SpaceX's historic IPO. 18 Hugging Face Daily Papers research 21d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by… 38 r/MachineLearning community 22d ago [R] AI Agent Security: The Complete Guide to Threats, Defenses, and the Future of Autonomous AI Safety [R] This is a comprehensive living reference guide to AI agent security — synthesizing 18 articles from The Agent Report covering the 75-day period (April–June 2026) when agent security went from theoretical concern to operational crisis. ​ What's inside: ​ • Incident… 4 Hugging Face Daily Papers research 22d ago The Role of Feedback Alignment in Self-Distillation Abstract Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 32 Google DeepMind official-blog 22d ago Investing in multi-agent AI safety research Google DeepMind and partners announce a $10M funding call for multi-agent safety research. 27 Stratechery (Ben Thompson) community 22d ago Fable 5, Anthropic Alignment, AI Tiers Fable 5 is the public version of Mythos, and while it is very capable it sets some troubling new precedents. 25 arXiv — NLP / Computation & Language research 22d ago Mechanistic Analysis of Alignment Algorithms in Language Models arXiv:2606.09850v1 Announce Type: cross Abstract: Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization… 22 arXiv — Machine Learning research 22d ago Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,… 23 arXiv — Machine Learning research 22d ago Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning arXiv:2606.09866v1 Announce Type: new Abstract: Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our… 28 arXiv — NLP / Computation & Language research 22d ago PreAct-Bench: Benchmarking Predictive Monitoring in LLMs arXiv:2606.09890v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior… 17 arXiv — Machine Learning research 22d ago Quality Is Not a Safety Proxy Under Quantization arXiv:2606.10154v1 Announce Type: new Abstract: Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF… 37 arXiv — Machine Learning research 22d ago A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport arXiv:2606.10216v1 Announce Type: new Abstract: Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These… 12 arXiv — Machine Learning research 22d ago Alignment Defends LLMs from Property Inference Attacks arXiv:2606.10217v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted… 18 arXiv — Machine Learning research 22d ago SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration arXiv:2606.10228v1 Announce Type: new Abstract: Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to… 15 arXiv — NLP / Computation & Language research 22d ago BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts arXiv:2606.10061v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research… 27 arXiv — NLP / Computation & Language research 22d ago Pareto-Guided Teacher Alignment for Fair Personalized Text Generation arXiv:2606.10126v1 Announce Type: new Abstract: Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained… 31 arXiv — NLP / Computation & Language research 22d ago Hidden Consensus:Preference-Validity Compression in Human Feedback arXiv:2606.10569v1 Announce Type: new Abstract: Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect… 7 arXiv — NLP / Computation & Language research 22d ago Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming arXiv:2606.10675v1 Announce Type: new Abstract: We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech… 34 arXiv — NLP / Computation & Language research 22d ago Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the… 18 arXiv — NLP / Computation & Language research 22d ago Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models arXiv:2606.11167v1 Announce Type: new Abstract: Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level… 23 arXiv — NLP / Computation & Language research 22d ago SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech arXiv:2606.06037v2 Announce Type: cross Abstract: Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability… 29 arXiv — NLP / Computation & Language research 22d ago ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs arXiv:2606.10461v1 Announce Type: cross Abstract: Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown… 8 Hugging Face Daily Papers research 22d ago When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated… 12 Hugging Face Daily Papers research 22d ago Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating Abstract Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities. Generated by… 24 Hugging Face Daily Papers research 22d ago BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts Abstract Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 14 Page 5 of 10 · 500 articles ← Newer Older →