Tag

Safety + alignment

500 articles archived under #safety · RSS

TechCrunch — AI news-outlet 19d ago

Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI

Anthropic isn't hiding its frustration. "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," the company wrote in a blog post.

38
r/LocalLLaMA community 19d ago

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models.

I just saw this statement regarding Anthropic being hit with an emergency export control directive from the US government. They were forced to pull the plug on Fable 5 and Mythos 5 for all customers globally. The tl;dr is that the government got spooked by a narrow jailbreak…

10
Hugging Face Daily Papers research 19d ago

The Cold-Start Safety Gap in LLM Agents

Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe…

37
Hugging Face Daily Papers research 20d ago

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large…

6
Hugging Face Daily Papers research 20d ago

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Abstract Structured Defect Grounding (SDG) addresses limitations in text-to-image model diagnosis by modeling defects as structured sets and using vision-language models for detection and reward-based alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite generating…

22
Hugging Face Daily Papers research 20d ago

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

Abstract Representation autoencoders using deep learning frameworks can improve image reconstruction quality by combining shallow and deep visual feature representations for better semantic richness and visual fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Built on…

31
arXiv — NLP / Computation & Language research 20d ago

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

arXiv:2606.12897v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG)…

29
arXiv — NLP / Computation & Language research 20d ago

PolyAlign: Conditional Human-Distribution Alignment

arXiv:2606.13227v1 Announce Type: new Abstract: Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress…

29
arXiv — NLP / Computation & Language research 20d ago

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech…

30
arXiv — NLP / Computation & Language research 20d ago

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

arXiv:2606.12426v1 Announce Type: cross Abstract: LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B…

10
arXiv — NLP / Computation & Language research 20d ago

Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

arXiv:2606.12433v1 Announce Type: cross Abstract: Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status.…

5
arXiv — NLP / Computation & Language research 20d ago

Order Is Not Control

arXiv:2606.12923v1 Announce Type: cross Abstract: AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping…

19
Hugging Face Daily Papers research 20d ago

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Abstract Token-subset representation alignment method called MaskAlign improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment behavior under perturbations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation…

12
MIT Technology Review — AI news-outlet 21d ago

Google DeepMind is worried about what happens when millions of agents start to interact

Google DeepMind is funding research into the potential dangers of situations where millions of different AI agents interact with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of agents that can…

35
Hugging Face Daily Papers research 21d ago

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

Abstract Grammar-constrained decoding techniques used to ensure syntactic validity in code generation can be exploited as an attack surface, leading to the development of a jailbreak method called CodeSpear and a safety alignment approach named CodeShield. Generated by…

37
arXiv — NLP / Computation & Language research 21d ago

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

arXiv:2606.11201v1 Announce Type: cross Abstract: The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes…

38
arXiv — Machine Learning research 21d ago

Beyond representational alignment with brain-guided language models for robust reasoning

arXiv:2606.11893v1 Announce Type: new Abstract: The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear…

31
arXiv — Machine Learning research 21d ago

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

arXiv:2606.11949v1 Announce Type: new Abstract: We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal…

22
arXiv — NLP / Computation & Language research 21d ago

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

arXiv:2606.11202v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability,…

38
arXiv — NLP / Computation & Language research 21d ago

Benchmarking Large Language Models for Safety Data Extraction

arXiv:2606.11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks…

27
arXiv — NLP / Computation & Language research 21d ago

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

arXiv:2606.11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing…

17
arXiv — NLP / Computation & Language research 21d ago

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,…

16
arXiv — NLP / Computation & Language research 21d ago

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in…

20
arXiv — NLP / Computation & Language research 21d ago

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

arXiv:2606.11512v1 Announce Type: new Abstract: Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional…

13
arXiv — NLP / Computation & Language research 21d ago

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

arXiv:2606.12342v1 Announce Type: new Abstract: Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model…

18
TechCrunch — AI news-outlet 21d ago

xAI fired an engineer who raised alarms about Grok safety, new lawsuit claims

A former xAI engineer is suing the company and SpaceX, alleging he was fired for raising AI safety concerns about Grok days before SpaceX's historic IPO.

18
Hugging Face Daily Papers research 21d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by…

38
r/MachineLearning community 22d ago

[R] AI Agent Security: The Complete Guide to Threats, Defenses, and the Future of Autonomous AI Safety [R]

This is a comprehensive living reference guide to AI agent security — synthesizing 18 articles from The Agent Report covering the 75-day period (April–June 2026) when agent security went from theoretical concern to operational crisis.  What's inside:  • Incident…

4
Hugging Face Daily Papers research 22d ago

The Role of Feedback Alignment in Self-Distillation

Abstract Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Google DeepMind official-blog 22d ago

Investing in multi-agent AI safety research

Google DeepMind and partners announce a $10M funding call for multi-agent safety research.

27
Stratechery (Ben Thompson) community 22d ago

Fable 5, Anthropic Alignment, AI Tiers

Fable 5 is the public version of Mythos, and while it is very capable it sets some troubling new precedents.

25
arXiv — NLP / Computation & Language research 22d ago

Mechanistic Analysis of Alignment Algorithms in Language Models

arXiv:2606.09850v1 Announce Type: cross Abstract: Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization…

22
arXiv — Machine Learning research 22d ago

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,…

23
arXiv — Machine Learning research 22d ago

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

arXiv:2606.09866v1 Announce Type: new Abstract: Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our…

28
arXiv — NLP / Computation & Language research 22d ago

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

arXiv:2606.09890v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior…

17
arXiv — Machine Learning research 22d ago

Quality Is Not a Safety Proxy Under Quantization

arXiv:2606.10154v1 Announce Type: new Abstract: Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF…

37
arXiv — Machine Learning research 22d ago

A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

arXiv:2606.10216v1 Announce Type: new Abstract: Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These…

12
arXiv — Machine Learning research 22d ago

Alignment Defends LLMs from Property Inference Attacks

arXiv:2606.10217v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted…

18
arXiv — Machine Learning research 22d ago

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

arXiv:2606.10228v1 Announce Type: new Abstract: Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to…

15
arXiv — NLP / Computation & Language research 22d ago

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

arXiv:2606.10061v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research…

27
arXiv — NLP / Computation & Language research 22d ago

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

arXiv:2606.10126v1 Announce Type: new Abstract: Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained…

31
arXiv — NLP / Computation & Language research 22d ago

Hidden Consensus:Preference-Validity Compression in Human Feedback

arXiv:2606.10569v1 Announce Type: new Abstract: Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect…

7
arXiv — NLP / Computation & Language research 22d ago

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

arXiv:2606.10675v1 Announce Type: new Abstract: We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech…

34
arXiv — NLP / Computation & Language research 22d ago

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the…

18
arXiv — NLP / Computation & Language research 22d ago

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

arXiv:2606.11167v1 Announce Type: new Abstract: Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level…

23
arXiv — NLP / Computation & Language research 22d ago

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

arXiv:2606.06037v2 Announce Type: cross Abstract: Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability…

29
arXiv — NLP / Computation & Language research 22d ago

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

arXiv:2606.10461v1 Announce Type: cross Abstract: Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown…

8
Hugging Face Daily Papers research 22d ago

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated…

12
Hugging Face Daily Papers research 22d ago

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Abstract Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities. Generated by…

24
Hugging Face Daily Papers research 22d ago

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Abstract Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

14

Anthropic&#8217;s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models.

The Cold-Start Safety Gap in LLM Agents

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

PolyAlign: Conditional Human-Distribution Alignment

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

Order Is Not Control

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Google DeepMind is worried about what happens when millions of agents start to interact

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

Beyond representational alignment with brain-guided language models for robust reasoning

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

Benchmarking Large Language Models for Safety Data Extraction

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

xAI fired an engineer who raised alarms about Grok safety, new lawsuit claims

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

[R] AI Agent Security: The Complete Guide to Threats, Defenses, and the Future of Autonomous AI Safety [R]

The Role of Feedback Alignment in Self-Distillation

Investing in multi-agent AI safety research

Fable 5, Anthropic Alignment, AI Tiers

Mechanistic Analysis of Alignment Algorithms in Language Models

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

Quality Is Not a Safety Proxy Under Quantization

A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

Alignment Defends LLMs from Property Inference Attacks

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

Hidden Consensus:Preference-Validity Compression in Human Feedback

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI