News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — Machine Learning research 1mo ago Parallel Tempering Initial Sampling in Inference-Time Reward Alignment arXiv:2605.30991v1 Announce Type: new Abstract: Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this… 14 arXiv — NLP / Computation & Language research 1mo ago Configurable Reward Model for Balanced Safety Alignment arXiv:2605.30487v1 Announce Type: new Abstract: Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety… 11 arXiv — NLP / Computation & Language research 1mo ago Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty arXiv:2605.30675v1 Announce Type: new Abstract: Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the… 5 arXiv — NLP / Computation & Language research 1mo ago Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents arXiv:2605.30723v1 Announce Type: new Abstract: LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as… 15 arXiv — NLP / Computation & Language research 1mo ago Pairwise Reference Alignment as a Model-Level Ordinal Observable arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference… 18 arXiv — NLP / Computation & Language research 1mo ago The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement arXiv:2605.30888v1 Announce Type: new Abstract: Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the… 31 arXiv — NLP / Computation & Language research 1mo ago ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails arXiv:2605.31073v1 Announce Type: new Abstract: Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent… 16 arXiv — NLP / Computation & Language research 1mo ago Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards arXiv:2605.31328v1 Announce Type: new Abstract: Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT)… 20 arXiv — NLP / Computation & Language research 1mo ago LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues… 36 r/LocalLLaMA community 1mo ago 13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities. coder3101's variant achieves 96% ASR with capability… 17 Hugging Face Daily Papers research 1mo ago Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Abstract Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling. AI-generated… 17 arXiv — Machine Learning research 1mo ago Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents arXiv:2605.28850v1 Announce Type: new Abstract: We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. Using TradeArena, an auditable trading-agent testbed with risk reports, execution simulation,… 26 arXiv — Machine Learning research 1mo ago Representation Alignment Rests on Linear Structure arXiv:2605.28870v1 Announce Type: new Abstract: We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal… 11 arXiv — Machine Learning research 1mo ago A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio arXiv:2605.28975v1 Announce Type: new Abstract: We study the log-alignment ratio (LAR), a measure of parameter-activation alignment, introduced in parameterization theory. We reformulate it as the overlap between a weight spectrum $p$ of the normalized squared singular values of… 37 arXiv — Machine Learning research 1mo ago Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning arXiv:2605.29028v1 Announce Type: new Abstract: Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their… 17 arXiv — Machine Learning research 1mo ago PROTOCOL: Late Interaction Retrieval for Protein Homolog Search arXiv:2605.29158v1 Announce Type: new Abstract: Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose… 12 arXiv — Machine Learning research 1mo ago SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction arXiv:2605.29236v1 Announce Type: new Abstract: Alarm fatigue in intensive care units (ICUs) is a well documented patient safety crisis. Clinical monitors generate 350 or more alarms per patient per day, out of which 72-99% are clinically irrelevant. Staff desensitization to… 29 arXiv — Machine Learning research 1mo ago Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content arXiv:2605.29659v1 Announce Type: new Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail… 4 arXiv — NLP / Computation & Language research 1mo ago From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale arXiv:2605.28826v1 Announce Type: new Abstract: In modern LLMs, linguistic features function not as stylistic artifacts but as probes of probability mass, allocated under training alignment objectives. Language models trained with contemporary pipelines exhibit severe reshaping… 35 arXiv — NLP / Computation & Language research 1mo ago Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated… 19 arXiv — NLP / Computation & Language research 1mo ago GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how… 31 arXiv — NLP / Computation & Language research 1mo ago Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents arXiv:2605.29224v1 Announce Type: new Abstract: AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment… 15 arXiv — NLP / Computation & Language research 1mo ago A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods… 19 arXiv — NLP / Computation & Language research 1mo ago Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset arXiv:2605.29365v1 Announce Type: new Abstract: Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human… 20 arXiv — NLP / Computation & Language research 1mo ago Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning arXiv:2605.29414v1 Announce Type: new Abstract: Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However,… 32 arXiv — NLP / Computation & Language research 1mo ago Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment arXiv:2605.29458v1 Announce Type: new Abstract: Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and… 36 arXiv — NLP / Computation & Language research 1mo ago Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries,… 9 arXiv — NLP / Computation & Language research 1mo ago Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs arXiv:2605.29708v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled… 36 Hugging Face Daily Papers research 1mo ago LoMo: Local Modality Substitution for Deeper Vision-Language Fusion Abstract Vision-language models suffer from modality sensitivity due to training data bias, but a new data curation approach called Local Modality Substitution improves cross-modal representation alignment and reasoning performance. AI-generated summary Vision-Language Models… 26 Hugging Face Daily Papers research 1mo ago AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security Abstract A lightweight and scalable agent safety alignment framework is proposed to address emerging threats from advanced AI models, featuring taxonomy-guided training with minimal samples and efficient deployment in real-world scenarios. AI-generated summary Modern open-world… 23 Hugging Face Daily Papers research 1mo ago Native Audio-Visual Alignment for Generation Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and… 38 The Information — AI news-outlet 1mo ago Illinois Legislature Passes Landmark AI Safety Bill On Wednesday, the Illinois House of Representatives passed a bill that will require major AI companies to submit their model safety plans for third-party audits, as well as creating whistleblower protections for those companies’ employees. While Governor JB Pritzker still has to… 9 Ars Technica — AI news-outlet 1mo ago Trump loses more control over AI regulation as Illinois passes landmark law Here’s why Anthropic and OpenAI are on board with Illinois safety testing. 9 arXiv — Machine Learning research 1mo ago Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment arXiv:2605.27659v1 Announce Type: new Abstract: Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world… 15 arXiv — Machine Learning research 1mo ago High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention arXiv:2605.27758v1 Announce Type: new Abstract: Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While… 18 arXiv — Machine Learning research 1mo ago A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized… 17 arXiv — Machine Learning research 1mo ago FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation arXiv:2605.27892v1 Announce Type: new Abstract: Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are… 32 arXiv — Machine Learning research 1mo ago AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels arXiv:2605.28021v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on… 25 arXiv — Machine Learning research 1mo ago SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection arXiv:2605.28030v1 Announce Type: new Abstract: Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a… 22 arXiv — NLP / Computation & Language research 1mo ago ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment arXiv:2605.27374v1 Announce Type: new Abstract: Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical… 26 arXiv — NLP / Computation & Language research 1mo ago Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models arXiv:2605.27383v1 Announce Type: new Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by… 24 arXiv — NLP / Computation & Language research 1mo ago Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities arXiv:2605.27388v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical… 16 arXiv — NLP / Computation & Language research 1mo ago PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI arXiv:2605.27545v1 Announce Type: new Abstract: Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple… 38 arXiv — NLP / Computation & Language research 1mo ago TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling arXiv:2605.27690v1 Announce Type: new Abstract: LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore… 22 arXiv — NLP / Computation & Language research 1mo ago The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages arXiv:2605.27901v1 Announce Type: new Abstract: Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse… 30 arXiv — NLP / Computation & Language research 1mo ago KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks arXiv:2605.28013v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations:… 18 arXiv — NLP / Computation & Language research 1mo ago Chinese Word Boundary Recovery through Character Alignment Projection arXiv:2605.28128v1 Announce Type: new Abstract: Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper… 30 arXiv — NLP / Computation & Language research 1mo ago Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment arXiv:2605.28188v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but… 23 arXiv — NLP / Computation & Language research 1mo ago CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models arXiv:2605.28292v1 Announce Type: new Abstract: Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example… 31 arXiv — NLP / Computation & Language research 1mo ago HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment arXiv:2605.28308v1 Announce Type: new Abstract: Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject… 11 Page 8 of 10 · 500 articles ← Newer Older →