Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 28d ago

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

arXiv:2606.04978v1 Announce Type: new Abstract: LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a…

26
Hugging Face Daily Papers research 28d ago

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Abstract BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use agents extend language…

30
OpenAI official-blog 29d ago

OpenAI public policy agenda

OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure AI benefits society.

10
OpenAI official-blog 29d ago

A blueprint for democratic governance of frontier AI

OpenAI outlines a blueprint for U.S. governance of frontier AI, proposing a federal framework for safety, resilience, and national security.

11
arXiv — Machine Learning research 29d ago

Assessing Region-Level EEG Contributions to Cognitive Workload Prediction

arXiv:2606.02598v1 Announce Type: new Abstract: Accurate and generalizable estimation of cognitive workload from electroencephalography (EEG) is critical for human-centered and safety-critical systems. Although EEG is widely used for workload assessment, the consistency of…

29
arXiv — Machine Learning research 29d ago

Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis

arXiv:2606.02671v1 Announce Type: new Abstract: Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation…

19
arXiv — Machine Learning research 29d ago

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation…

27
arXiv — Machine Learning research 29d ago

Libra: Efficient Resource Management for Agentic RL Post-Training

arXiv:2606.03077v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout…

23
arXiv — Machine Learning research 29d ago

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

arXiv:2606.03131v1 Announce Type: new Abstract: Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real…

15
arXiv — NLP / Computation & Language research 29d ago

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

arXiv:2606.03022v1 Announce Type: new Abstract: Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address…

14
arXiv — NLP / Computation & Language research 29d ago

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv:2606.03043v1 Announce Type: new Abstract: LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard…

18
arXiv — NLP / Computation & Language research 29d ago

Coherence Maximization Improves Pluralistic Alignment

arXiv:2606.03110v1 Announce Type: new Abstract: Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these…

16
arXiv — NLP / Computation & Language research 29d ago

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

arXiv:2606.03165v1 Announce Type: new Abstract: The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking…

30
arXiv — NLP / Computation & Language research 29d ago

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

arXiv:2606.03648v1 Announce Type: new Abstract: Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and…

32
arXiv — NLP / Computation & Language research 29d ago

Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

arXiv:2606.03695v1 Announce Type: new Abstract: As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the…

19
arXiv — NLP / Computation & Language research 29d ago

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric…

19
arXiv — NLP / Computation & Language research 29d ago

Consistency Training Can Entrench Misalignment

arXiv:2606.03810v1 Announce Type: new Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly…

31
arXiv — NLP / Computation & Language research 29d ago

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

arXiv:2606.03967v1 Announce Type: new Abstract: We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated…

16
arXiv — NLP / Computation & Language research 29d ago

Quantifying Faithful Confidence Expression in Large Reasoning Models

arXiv:2606.03969v1 Announce Type: new Abstract: Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This…

35
Hugging Face Daily Papers research 1mo ago

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Abstract Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning…

17
Hugging Face Daily Papers research 1mo ago

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Abstract A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research Agents have shown strong…

4
Hugging Face Daily Papers research 1mo ago

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Abstract Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM…

25
r/MachineLearning community 1mo ago

Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]

Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training break that, and does the learning rule matter? Setup: RSA alignment measured at 8…

30
Hugging Face Daily Papers research 1mo ago

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Abstract Physical AI systems face safety challenges where black-box models can execute harmful actions without detection, necessitating comprehensive runtime guardrail mechanisms for safe operation. AI-generated summary Physical AI systems increasingly map multimodal…

12
OpenAI official-blog 1mo ago

Advancing youth safety and opportunity through global leadership

OpenAI calls for global action on youth AI safety through a dedicated AI Safety Institute

4
arXiv — Machine Learning research 1mo ago

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

arXiv:2606.00341v1 Announce Type: new Abstract: As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work…

12
arXiv — Machine Learning research 1mo ago

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

arXiv:2606.00400v1 Announce Type: new Abstract: Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed…

11
arXiv — Machine Learning research 1mo ago

MESA: Improving MoE Safety Alignment via Decentralized Expertise

arXiv:2606.00651v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical…

36
arXiv — Machine Learning research 1mo ago

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

arXiv:2606.00686v1 Announce Type: new Abstract: The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach…

7
arXiv — NLP / Computation & Language research 1mo ago

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

arXiv:2606.00027v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a…

37
arXiv — NLP / Computation & Language research 1mo ago

RealityTest: How People Probe AI Identity and Whether Models Disclose It

arXiv:2606.00168v1 Announce Type: new Abstract: AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of…

24
arXiv — NLP / Computation & Language research 1mo ago

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

arXiv:2606.00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around…

23
arXiv — NLP / Computation & Language research 1mo ago

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

arXiv:2606.00334v1 Announce Type: new Abstract: Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are…

21
arXiv — NLP / Computation & Language research 1mo ago

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

arXiv:2606.00975v1 Announce Type: new Abstract: LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates…

8
arXiv — NLP / Computation & Language research 1mo ago

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

arXiv:2606.01060v1 Announce Type: new Abstract: Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and…

28
arXiv — NLP / Computation & Language research 1mo ago

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

arXiv:2606.01196v1 Announce Type: new Abstract: Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive…

34
arXiv — NLP / Computation & Language research 1mo ago

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

arXiv:2606.01276v1 Announce Type: new Abstract: Large language model (LLM)-based machine translation has advanced cross-cultural communication, yet it still struggles with culture-loaded words (CLWs) in ancient Chinese texts. The challenge extends beyond lexical alignment to…

33
arXiv — NLP / Computation & Language research 1mo ago

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

arXiv:2606.01322v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven…

19
Hugging Face Daily Papers research 1mo ago

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

Abstract Model-aware skill alignment framework adapts skills to different backbones through hierarchical evolution and lightweight rewriter training, achieving superior performance across interactive tasks. AI-generated summary LLM agents increasingly retrieve externally curated…

18
OpenAI official-blog 1mo ago

Our views on AI policy and political advocacy

Our approach to AI policy and political advocacy, transparency, support for thoughtful regulation and AI safety, and that no outside political group speaks on the company’s behalf.

26
The Information — AI news-outlet 1mo ago

Florida Sues OpenAI and Sam Altman Over Safety Concerns

Florida Attorney General James Uthmeier on Monday sued OpenAI and its chief executive Sam Altman, alleging 10 counts of negligence, liability, and other state law violations related to safety concerns over OpenAI’s consumer-facing tool ChatGPT. With the lawsuit, Florida became…

24
Hugging Face Daily Papers research 1mo ago

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Abstract SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives. AI-generated summary Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and…

26
arXiv — Machine Learning research 1mo ago

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern,…

24
arXiv — Machine Learning research 1mo ago

Calibrated Preference Learning: The Case of Label Ranking

arXiv:2605.30447v1 Announce Type: new Abstract: Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally…

20
arXiv — Machine Learning research 1mo ago

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

arXiv:2605.30526v1 Announce Type: new Abstract: Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies…

11
arXiv — Machine Learning research 1mo ago

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

arXiv:2605.30556v1 Announce Type: new Abstract: Random, untrained neural networks consistently match or exceed trained networks in representational similarity to early visual cortex. This puzzling finding challenges the assumption that learning improves brain alignment. We…

22
arXiv — Machine Learning research 1mo ago

Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation

arXiv:2605.30585v1 Announce Type: new Abstract: Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty quantification to ensure reliability and safety. This paper investigates five major…

38
arXiv — Machine Learning research 1mo ago

CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment

arXiv:2605.30635v1 Announce Type: new Abstract: Inferring dynamics from population snapshots is a fundamental challenge in machine learning and biology. In scRNA-sequencing (scRNA-seq), destructive measurements preclude direct tracking of individual cells across time, making…

31
arXiv — Machine Learning research 1mo ago

CSULoRA: Closest Safe Update Low-Rank Adaptation

arXiv:2605.30640v1 Announce Type: new Abstract: Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned…

28
arXiv — Machine Learning research 1mo ago

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

arXiv:2605.30873v1 Announce Type: new Abstract: Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user…

35

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

OpenAI public policy agenda

A blueprint for democratic governance of frontier AI

Assessing Region-Level EEG Contributions to Cognitive Workload Prediction

Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Libra: Efficient Resource Management for Agentic RL Post-Training

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Coherence Maximization Improves Pluralistic Alignment

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

Consistency Training Can Entrench Misalignment

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

Quantifying Faithful Confidence Expression in Large Reasoning Models

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Advancing youth safety and opportunity through global leadership

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

RealityTest: How People Probe AI Identity and Whether Models Disclose It

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

Our views on AI policy and political advocacy

Florida Sues OpenAI and Sam Altman Over Safety Concerns

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Calibrated Preference Learning: The Case of Label Ranking

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation

CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment

CSULoRA: Closest Safe Update Low-Rank Adaptation

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences