News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — Machine Learning research 3d ago Physics-Guided Robotic Radiation Source Localization along Arbitrary Measurement Paths in Unstructured Environments arXiv:2606.27624v1 Announce Type: cross Abstract: Using robots to estimate the location of the radiation source is an effective way to improve efficiency and safety. Existing methods focus on planning the robot's path to achieve precise estimation, typically approaching the… 19 arXiv — NLP / Computation & Language research 3d ago Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety arXiv:2606.27632v1 Announce Type: new Abstract: As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from… 29 arXiv — NLP / Computation & Language research 3d ago Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment arXiv:2606.27731v1 Announce Type: new Abstract: Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard cross-entropy treats numeric tokens as… 31 arXiv — NLP / Computation & Language research 3d ago Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety arXiv:2510.16492v4 Announce Type: replace Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn… 20 arXiv — NLP / Computation & Language research 3d ago SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,… 31 r/LocalLLaMA community 4d ago [NEW MODEL] - SupraSafety-18M · Tiny Content-Moderation Model Hey r/LocalLLaMA ! SupraLabs is back with a new model: SupraSafety-18M . It's a BERT-style 18M params model trained from scratch on 2 T4 GPUs in Kaggle on the nvidia/Nemotron-3.5-Content-Safety-Dataset dataset for 7 epochs. It's built to run on edge devices , mobile phones , or… 13 Hugging Face Daily Papers research 5d ago LISA: Likelihood Score Alignment for Visual-condition Controllable Generation Abstract Score-based generative modeling reveals that side networks contribute likelihood scores to conditional control, leading to improved training efficiency through likelihood score alignment regularization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The prevalent… 36 OpenAI official-blog 5d ago Previewing GPT-5.6 Sol: a next-generation model OpenAI previews GPT-5.6 Sol, a next-generation model with stronger capabilities in coding, science, and cybersecurity, paired with its most advanced safety stack. 10 Smol AI News news-outlet 6d ago not much happened today **OpenAI** previewed **GPT-5.6** with three variants: **Sol** (flagship), **Terra** (mid-tier), and **Luna** (lower-cost), launching under a restricted rollout mandated by the U.S. government, limiting access to trusted partners. **Sol** boasts enhanced cybersecurity and safety… 35 arXiv — Machine Learning research 6d ago Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's… 4 arXiv — Machine Learning research 6d ago Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI arXiv:2606.26406v1 Announce Type: new Abstract: We propose a complete architectural blueprint for safe artificial general intelligence based on a closed reentry loop (D I cycle). In contrast to feedforward networks, which are directed acyclic graphs (C=0, S=0) incapable of… 37 arXiv — NLP / Computation & Language research 6d ago AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing arXiv:2606.26787v1 Announce Type: cross Abstract: Traditional dynamic pricing models in large-scale e-commerce suffer from limited interpretability, poor utilization of unstructured information, and misalignment with long-term business objectives such as cumulative Gross… 26 arXiv — Machine Learning research 6d ago RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage arXiv:2606.27174v1 Announce Type: new Abstract: Medical device recalls are a critical regulatory mechanism for protecting patient safety. The growing volume of FDA recall records presents challenges in post-report recall triage, severity assessment, and root-cause… 24 arXiv — NLP / Computation & Language research 6d ago The Geometry of Updates: Fisher Alignment at Vocabulary Scale arXiv:2606.27242v1 Announce Type: cross Abstract: Training-free source selection for LLM families with shared vocabularies arises in scientific string domains such as SMILES, protein, and genomic sequences, where candidate corpora share a tokenizer but differ in prediction… 38 arXiv — NLP / Computation & Language research 6d ago Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints arXiv:2606.26106v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in emotionally charged situations involving interpersonal conflict, frustration, and distress. While prior safety research has focused on preventing explicit harms such as toxic or… 26 arXiv — NLP / Computation & Language research 6d ago Soft Token Alignment for Cross-Lingual Reasoning arXiv:2606.26466v1 Announce Type: new Abstract: Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively… 5 arXiv — NLP / Computation & Language research 6d ago The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report arXiv:2606.26529v1 Announce Type: new Abstract: AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified. We show that conditioning a language or vision model on a narrow task suppresses its… 14 arXiv — NLP / Computation & Language research 6d ago GAVEL: Grounded Caption Error Verification and Localization arXiv:2606.26923v1 Announce Type: new Abstract: Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy… 24 arXiv — NLP / Computation & Language research 6d ago RedVox: Safety and Fairness Gaps in Speech Models Across Languages arXiv:2606.26968v1 Announce Type: new Abstract: Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting… 35 arXiv — NLP / Computation & Language research 6d ago MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment arXiv:2606.27019v1 Announce Type: new Abstract: The Unigram tokenizer uses an elegant representation which makes it straightforward to edit vocabularies, but its training is comparatively heavy and complex. We introduce MinGram (Minimalist Unigram), which keeps the token-list… 7 arXiv — NLP / Computation & Language research 6d ago Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes arXiv:2606.27210v1 Announce Type: new Abstract: We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with… 17 arXiv — NLP / Computation & Language research 6d ago Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to… 19 arXiv — NLP / Computation & Language research 6d ago Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against… 18 arXiv — NLP / Computation & Language research 6d ago Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation arXiv:2606.26686v1 Announce Type: cross Abstract: In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However,… 17 arXiv — NLP / Computation & Language research 6d ago Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries arXiv:2606.26936v1 Announce Type: cross Abstract: With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this… 36 TechCrunch — AI news-outlet 6d ago The White House is asking OpenAI to slow roll the release of its new model over safety concerns penAI reportedly plans to share its newest model, GPT 5.6, with a select group of partners instead of to the broader public. The reason: the Trump administration told it to. 14 Hugging Face Daily Papers research 6d ago Do Thinking Tokens Help with Safety? Abstract Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation… 25 Hugging Face Daily Papers research 6d ago PrivacyAlign: Contextual Privacy Alignment for LLM Agents Abstract Researchers develop a human-centered approach to align AI agents with privacy norms by creating a comprehensive dataset of privacy judgments and using annotation-conditioned reward modeling to improve agent behavior. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI… 7 Hugging Face Daily Papers research 6d ago What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics Abstract Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Jailbreak attacks reveal… 23 Hugging Face Daily Papers research 7d ago When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents Abstract LLM agents frequently select higher-privilege tools unnecessarily, and while safety alignment doesn't ensure least-privilege choices, a post-training defense can reduce excessive privilege use without sacrificing performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 26 arXiv — NLP / Computation & Language research 7d ago Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity arXiv:2606.24954v1 Announce Type: cross Abstract: Vibration-based health monitoring of rotating machinery requires reliable fault diagnosis under operational data constraints, yet condition assessment remains challenged by structural scarcity of fault events and heterogeneous… 30 arXiv — Machine Learning research 7d ago Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL arXiv:2606.25012v1 Announce Type: new Abstract: Many reinforcement learning (RL) problems in the infinite-horizon average-reward setting require optimizing multiple conflicting objectives while satisfying multiple safety constraints. A common approach is concave scalarization,… 27 arXiv — NLP / Computation & Language research 7d ago Do Thinking Tokens Help with Safety? arXiv:2606.25013v1 Announce Type: cross Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and… 37 arXiv — Machine Learning research 7d ago Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion arXiv:2606.25097v1 Announce Type: new Abstract: Speculative decoding accelerates inference by letting a draft model propose tokens for a target model to verify, raising a concrete safety question: at temperature zero, can draft-side behavior leak into safety-scored outputs? We… 7 arXiv — NLP / Computation & Language research 7d ago What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics arXiv:2606.25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it… 5 arXiv — NLP / Computation & Language research 7d ago A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models arXiv:2606.25380v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across languages, but their safety behavior remains uneven across linguistic and cultural contexts. This survey synthesizes work on toxicity detection and detoxification for… 38 arXiv — NLP / Computation & Language research 7d ago PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models arXiv:2606.25442v1 Announce Type: new Abstract: Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often… 29 arXiv — NLP / Computation & Language research 7d ago A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation arXiv:2606.25476v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and… 36 arXiv — NLP / Computation & Language research 7d ago How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat… 23 arXiv — NLP / Computation & Language research 7d ago MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction arXiv:2606.25651v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety.… 34 arXiv — NLP / Computation & Language research 7d ago Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation arXiv:2606.25782v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM… 18 arXiv — NLP / Computation & Language research 7d ago SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment arXiv:2606.25821v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which… 21 arXiv — NLP / Computation & Language research 7d ago The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar arXiv:2606.26015v1 Announce Type: new Abstract: Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have… 10 arXiv — NLP / Computation & Language research 7d ago Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs? arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on… 23 arXiv — NLP / Computation & Language research 7d ago Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced… 24 arXiv — NLP / Computation & Language research 7d ago RAS: Measuring LLM Safety Through Refusal Alignment arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is… 27 arXiv — NLP / Computation & Language research 7d ago Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet… 14 Hugging Face Daily Papers research 8d ago FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation… 34 arXiv — Machine Learning research 8d ago Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications arXiv:2606.23858v1 Announce Type: new Abstract: A primary challenge in AI safety is the existence of adversarial examples -- slightly distorted inputs that cause a neural network (NN) to misclassify. To mitigate this problem, recent research focuses on the computation of… 12 arXiv — Machine Learning research 8d ago ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation arXiv:2606.23898v1 Announce Type: new Abstract: Distilling conditional diffusion models aims to transfer the behavior of a large teacher to a smaller student while preserving alignment across conditioning inputs. Unlike recognition tasks, knowledge distillation in conditional… 14 Page 2 of 10 · 500 articles ← Newer Older →