News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 28d ago Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game arXiv:2606.04978v1 Announce Type: new Abstract: LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a… 26 Hugging Face Daily Papers research 28d ago BraveGuard: From Open-World Threats to Safer Computer-Use Agents Abstract BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use agents extend language… 30 OpenAI official-blog 29d ago OpenAI public policy agenda OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure AI benefits society. 10 OpenAI official-blog 29d ago A blueprint for democratic governance of frontier AI OpenAI outlines a blueprint for U.S. governance of frontier AI, proposing a federal framework for safety, resilience, and national security. 11 arXiv — Machine Learning research 29d ago Assessing Region-Level EEG Contributions to Cognitive Workload Prediction arXiv:2606.02598v1 Announce Type: new Abstract: Accurate and generalizable estimation of cognitive workload from electroencephalography (EEG) is critical for human-centered and safety-critical systems. Although EEG is widely used for workload assessment, the consistency of… 29 arXiv — Machine Learning research 29d ago Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis arXiv:2606.02671v1 Announce Type: new Abstract: Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation… 19 arXiv — Machine Learning research 29d ago Gate AI: LLM Security Benchmark Evaluation Methodology and Results arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation… 27 arXiv — Machine Learning research 29d ago Libra: Efficient Resource Management for Agentic RL Post-Training arXiv:2606.03077v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout… 23 arXiv — Machine Learning research 29d ago HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models arXiv:2606.03131v1 Announce Type: new Abstract: Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real… 15 arXiv — NLP / Computation & Language research 29d ago Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization arXiv:2606.03022v1 Announce Type: new Abstract: Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address… 14 arXiv — NLP / Computation & Language research 29d ago The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment arXiv:2606.03043v1 Announce Type: new Abstract: LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard… 18 arXiv — NLP / Computation & Language research 29d ago Coherence Maximization Improves Pluralistic Alignment arXiv:2606.03110v1 Announce Type: new Abstract: Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these… 16 arXiv — NLP / Computation & Language research 29d ago Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models arXiv:2606.03165v1 Announce Type: new Abstract: The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking… 30 arXiv — NLP / Computation & Language research 29d ago Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability arXiv:2606.03648v1 Announce Type: new Abstract: Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and… 32 arXiv — NLP / Computation & Language research 29d ago Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings arXiv:2606.03695v1 Announce Type: new Abstract: As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the… 19 arXiv — NLP / Computation & Language research 29d ago Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric… 19 arXiv — NLP / Computation & Language research 29d ago Consistency Training Can Entrench Misalignment arXiv:2606.03810v1 Announce Type: new Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly… 31 arXiv — NLP / Computation & Language research 29d ago AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task arXiv:2606.03967v1 Announce Type: new Abstract: We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated… 16 arXiv — NLP / Computation & Language research 29d ago Quantifying Faithful Confidence Expression in Large Reasoning Models arXiv:2606.03969v1 Announce Type: new Abstract: Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This… 35 Hugging Face Daily Papers research 1mo ago Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures Abstract Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning… 17 Hugging Face Daily Papers research 1mo ago TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation Abstract A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research Agents have shown strong… 4 Hugging Face Daily Papers research 1mo ago Review Arcade: On the Human Alignment and Gameability of LLM Reviews Abstract Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM… 25 r/MachineLearning community 1mo ago Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R] Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training break that, and does the learning rule matter? Setup: RSA alignment measured at 8… 30 Hugging Face Daily Papers research 1mo ago Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems Abstract Physical AI systems face safety challenges where black-box models can execute harmful actions without detection, necessitating comprehensive runtime guardrail mechanisms for safe operation. AI-generated summary Physical AI systems increasingly map multimodal… 12 OpenAI official-blog 1mo ago Advancing youth safety and opportunity through global leadership OpenAI calls for global action on youth AI safety through a dedicated AI Safety Institute 4 arXiv — Machine Learning research 1mo ago ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use arXiv:2606.00341v1 Announce Type: new Abstract: As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work… 12 arXiv — Machine Learning research 1mo ago Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning arXiv:2606.00400v1 Announce Type: new Abstract: Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed… 11 arXiv — Machine Learning research 1mo ago MESA: Improving MoE Safety Alignment via Decentralized Expertise arXiv:2606.00651v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical… 36 arXiv — Machine Learning research 1mo ago Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing arXiv:2606.00686v1 Announce Type: new Abstract: The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach… 7 arXiv — NLP / Computation & Language research 1mo ago A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models arXiv:2606.00027v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a… 37 arXiv — NLP / Computation & Language research 1mo ago RealityTest: How People Probe AI Identity and Whether Models Disclose It arXiv:2606.00168v1 Announce Type: new Abstract: AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of… 24 arXiv — NLP / Computation & Language research 1mo ago Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models arXiv:2606.00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around… 23 arXiv — NLP / Computation & Language research 1mo ago Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning arXiv:2606.00334v1 Announce Type: new Abstract: Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are… 21 arXiv — NLP / Computation & Language research 1mo ago Lost in Delusion: Examining LLM Safety Under User Delusions and Distress arXiv:2606.00975v1 Announce Type: new Abstract: LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates… 8 arXiv — NLP / Computation & Language research 1mo ago MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models arXiv:2606.01060v1 Announce Type: new Abstract: Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and… 28 arXiv — NLP / Computation & Language research 1mo ago Low-Resource Safety Failures Are Action Failures, Not Representation Failures arXiv:2606.01196v1 Announce Type: new Abstract: Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive… 34 arXiv — NLP / Computation & Language research 1mo ago Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination arXiv:2606.01276v1 Announce Type: new Abstract: Large language model (LLM)-based machine translation has advanced cross-cultural communication, yet it still struggles with culture-loaded words (CLWs) in ancient Chinese texts. The challenge extends beyond lexical alignment to… 33 arXiv — NLP / Computation & Language research 1mo ago TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages arXiv:2606.01322v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven… 19 Hugging Face Daily Papers research 1mo ago Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents Abstract Model-aware skill alignment framework adapts skills to different backbones through hierarchical evolution and lightweight rewriter training, achieving superior performance across interactive tasks. AI-generated summary LLM agents increasingly retrieve externally curated… 18 OpenAI official-blog 1mo ago Our views on AI policy and political advocacy Our approach to AI policy and political advocacy, transparency, support for thoughtful regulation and AI safety, and that no outside political group speaks on the company’s behalf. 26 The Information — AI news-outlet 1mo ago Florida Sues OpenAI and Sam Altman Over Safety Concerns Florida Attorney General James Uthmeier on Monday sued OpenAI and its chief executive Sam Altman, alleging 10 counts of negligence, liability, and other state law violations related to safety concerns over OpenAI’s consumer-facing tool ChatGPT. With the lawsuit, Florida became… 24 Hugging Face Daily Papers research 1mo ago The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement Abstract SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives. AI-generated summary Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and… 26 arXiv — Machine Learning research 1mo ago When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern,… 24 arXiv — Machine Learning research 1mo ago Calibrated Preference Learning: The Case of Label Ranking arXiv:2605.30447v1 Announce Type: new Abstract: Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally… 20 arXiv — Machine Learning research 1mo ago Measuring, Localizing, and Ablating Alignment Signatures in LLMs arXiv:2605.30526v1 Announce Type: new Abstract: Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies… 11 arXiv — Machine Learning research 1mo ago Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules arXiv:2605.30556v1 Announce Type: new Abstract: Random, untrained neural networks consistently match or exceed trained networks in representational similarity to early visual cortex. This puzzling finding challenges the assumption that learning improves brain alignment. We… 22 arXiv — Machine Learning research 1mo ago Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation arXiv:2605.30585v1 Announce Type: new Abstract: Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty quantification to ensure reliability and safety. This paper investigates five major… 38 arXiv — Machine Learning research 1mo ago CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment arXiv:2605.30635v1 Announce Type: new Abstract: Inferring dynamics from population snapshots is a fundamental challenge in machine learning and biology. In scRNA-sequencing (scRNA-seq), destructive measurements preclude direct tracking of individual cells across time, making… 31 arXiv — Machine Learning research 1mo ago CSULoRA: Closest Safe Update Low-Rank Adaptation arXiv:2605.30640v1 Announce Type: new Abstract: Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned… 28 arXiv — Machine Learning research 1mo ago Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences arXiv:2605.30873v1 Announce Type: new Abstract: Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user… 35 Page 7 of 10 · 500 articles ← Newer Older →