Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — Machine Learning research 3d ago

Physics-Guided Robotic Radiation Source Localization along Arbitrary Measurement Paths in Unstructured Environments

arXiv:2606.27624v1 Announce Type: cross Abstract: Using robots to estimate the location of the radiation source is an effective way to improve efficiency and safety. Existing methods focus on planning the robot's path to achieve precise estimation, typically approaching the…

19
arXiv — NLP / Computation & Language research 3d ago

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

arXiv:2606.27632v1 Announce Type: new Abstract: As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from…

29
arXiv — NLP / Computation & Language research 3d ago

Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment

arXiv:2606.27731v1 Announce Type: new Abstract: Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard cross-entropy treats numeric tokens as…

31
arXiv — NLP / Computation & Language research 3d ago

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

arXiv:2510.16492v4 Announce Type: replace Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn…

20
arXiv — NLP / Computation & Language research 3d ago

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,…

31
r/LocalLLaMA community 4d ago

[NEW MODEL] - SupraSafety-18M · Tiny Content-Moderation Model

Hey r/LocalLLaMA ! SupraLabs is back with a new model: SupraSafety-18M . It's a BERT-style 18M params model trained from scratch on 2 T4 GPUs in Kaggle on the nvidia/Nemotron-3.5-Content-Safety-Dataset dataset for 7 epochs. It's built to run on edge devices , mobile phones , or…

13
Hugging Face Daily Papers research 5d ago

LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Abstract Score-based generative modeling reveals that side networks contribute likelihood scores to conditional control, leading to improved training efficiency through likelihood score alignment regularization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The prevalent…

36
OpenAI official-blog 5d ago

Previewing GPT-5.6 Sol: a next-generation model

OpenAI previews GPT-5.6 Sol, a next-generation model with stronger capabilities in coding, science, and cybersecurity, paired with its most advanced safety stack.

10
Smol AI News news-outlet 6d ago

not much happened today

**OpenAI** previewed **GPT-5.6** with three variants: **Sol** (flagship), **Terra** (mid-tier), and **Luna** (lower-cost), launching under a restricted rollout mandated by the U.S. government, limiting access to trusted partners. **Sol** boasts enhanced cybersecurity and safety…

35
arXiv — Machine Learning research 6d ago

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's…

4
arXiv — Machine Learning research 6d ago

Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI

arXiv:2606.26406v1 Announce Type: new Abstract: We propose a complete architectural blueprint for safe artificial general intelligence based on a closed reentry loop (D I cycle). In contrast to feedforward networks, which are directed acyclic graphs (C=0, S=0) incapable of…

37
arXiv — NLP / Computation & Language research 6d ago

AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing

arXiv:2606.26787v1 Announce Type: cross Abstract: Traditional dynamic pricing models in large-scale e-commerce suffer from limited interpretability, poor utilization of unstructured information, and misalignment with long-term business objectives such as cumulative Gross…

26
arXiv — Machine Learning research 6d ago

RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage

arXiv:2606.27174v1 Announce Type: new Abstract: Medical device recalls are a critical regulatory mechanism for protecting patient safety. The growing volume of FDA recall records presents challenges in post-report recall triage, severity assessment, and root-cause…

24
arXiv — NLP / Computation & Language research 6d ago

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

arXiv:2606.27242v1 Announce Type: cross Abstract: Training-free source selection for LLM families with shared vocabularies arises in scientific string domains such as SMILES, protein, and genomic sequences, where candidate corpora share a tokenizer but differ in prediction…

38
arXiv — NLP / Computation & Language research 6d ago

Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints

arXiv:2606.26106v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in emotionally charged situations involving interpersonal conflict, frustration, and distress. While prior safety research has focused on preventing explicit harms such as toxic or…

26
arXiv — NLP / Computation & Language research 6d ago

Soft Token Alignment for Cross-Lingual Reasoning

arXiv:2606.26466v1 Announce Type: new Abstract: Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively…

5
arXiv — NLP / Computation & Language research 6d ago

The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report

arXiv:2606.26529v1 Announce Type: new Abstract: AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified. We show that conditioning a language or vision model on a narrow task suppresses its…

14
arXiv — NLP / Computation & Language research 6d ago

GAVEL: Grounded Caption Error Verification and Localization

arXiv:2606.26923v1 Announce Type: new Abstract: Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy…

24
arXiv — NLP / Computation & Language research 6d ago

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

arXiv:2606.26968v1 Announce Type: new Abstract: Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting…

35
arXiv — NLP / Computation & Language research 6d ago

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

arXiv:2606.27019v1 Announce Type: new Abstract: The Unigram tokenizer uses an elegant representation which makes it straightforward to edit vocabularies, but its training is comparatively heavy and complex. We introduce MinGram (Minimalist Unigram), which keeps the token-list…

7
arXiv — NLP / Computation & Language research 6d ago

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

arXiv:2606.27210v1 Announce Type: new Abstract: We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with…

17
arXiv — NLP / Computation & Language research 6d ago

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to…

19
arXiv — NLP / Computation & Language research 6d ago

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against…

18
arXiv — NLP / Computation & Language research 6d ago

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

arXiv:2606.26686v1 Announce Type: cross Abstract: In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However,…

17
arXiv — NLP / Computation & Language research 6d ago

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

arXiv:2606.26936v1 Announce Type: cross Abstract: With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this…

36
TechCrunch — AI news-outlet 6d ago

The White House is asking OpenAI to slow roll the release of its new model over safety concerns

penAI reportedly plans to share its newest model, GPT 5.6, with a select group of partners instead of to the broader public. The reason: the Trump administration told it to.

14
Hugging Face Daily Papers research 6d ago

Do Thinking Tokens Help with Safety?

Abstract Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation…

25
Hugging Face Daily Papers research 6d ago

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

Abstract Researchers develop a human-centered approach to align AI agents with privacy norms by creating a comprehensive dataset of privacy judgments and using annotation-conditioned reward modeling to improve agent behavior. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI…

7
Hugging Face Daily Papers research 6d ago

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Abstract Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Jailbreak attacks reveal…

23
Hugging Face Daily Papers research 7d ago

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Abstract LLM agents frequently select higher-privilege tools unnecessarily, and while safety alignment doesn't ensure least-privilege choices, a post-training defense can reduce excessive privilege use without sacrificing performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

26
arXiv — NLP / Computation & Language research 7d ago

Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity

arXiv:2606.24954v1 Announce Type: cross Abstract: Vibration-based health monitoring of rotating machinery requires reliable fault diagnosis under operational data constraints, yet condition assessment remains challenged by structural scarcity of fault events and heterogeneous…

30
arXiv — Machine Learning research 7d ago

Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL

arXiv:2606.25012v1 Announce Type: new Abstract: Many reinforcement learning (RL) problems in the infinite-horizon average-reward setting require optimizing multiple conflicting objectives while satisfying multiple safety constraints. A common approach is concave scalarization,…

27
arXiv — NLP / Computation & Language research 7d ago

Do Thinking Tokens Help with Safety?

arXiv:2606.25013v1 Announce Type: cross Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and…

37
arXiv — Machine Learning research 7d ago

Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion

arXiv:2606.25097v1 Announce Type: new Abstract: Speculative decoding accelerates inference by letting a draft model propose tokens for a target model to verify, raising a concrete safety question: at temperature zero, can draft-side behavior leak into safety-scored outputs? We…

7
arXiv — NLP / Computation & Language research 7d ago

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

arXiv:2606.25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it…

5
arXiv — NLP / Computation & Language research 7d ago

A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

arXiv:2606.25380v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across languages, but their safety behavior remains uneven across linguistic and cultural contexts. This survey synthesizes work on toxicity detection and detoxification for…

38
arXiv — NLP / Computation & Language research 7d ago

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

arXiv:2606.25442v1 Announce Type: new Abstract: Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often…

29
arXiv — NLP / Computation & Language research 7d ago

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

arXiv:2606.25476v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and…

36
arXiv — NLP / Computation & Language research 7d ago

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat…

23
arXiv — NLP / Computation & Language research 7d ago

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

arXiv:2606.25651v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety.…

34
arXiv — NLP / Computation & Language research 7d ago

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

arXiv:2606.25782v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM…

18
arXiv — NLP / Computation & Language research 7d ago

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

arXiv:2606.25821v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which…

21
arXiv — NLP / Computation & Language research 7d ago

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

arXiv:2606.26015v1 Announce Type: new Abstract: Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have…

10
arXiv — NLP / Computation & Language research 7d ago

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on…

23
arXiv — NLP / Computation & Language research 7d ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced…

24
arXiv — NLP / Computation & Language research 7d ago

RAS: Measuring LLM Safety Through Refusal Alignment

arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is…

27
arXiv — NLP / Computation & Language research 7d ago

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet…

14
Hugging Face Daily Papers research 8d ago

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation…

34
arXiv — Machine Learning research 8d ago

Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications

arXiv:2606.23858v1 Announce Type: new Abstract: A primary challenge in AI safety is the existence of adversarial examples -- slightly distorted inputs that cause a neural network (NN) to misclassify. To mitigate this problem, recent research focuses on the computation of…

12
arXiv — Machine Learning research 8d ago

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

arXiv:2606.23898v1 Announce Type: new Abstract: Distilling conditional diffusion models aims to transfer the behavior of a large teacher to a smaller student while preserving alignment across conditioning inputs. Unlike recognition tasks, knowledge distillation in conditional…

14

Physics-Guided Robotic Radiation Source Localization along Arbitrary Measurement Paths in Unstructured Environments

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

[NEW MODEL] - SupraSafety-18M · Tiny Content-Moderation Model

LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

Previewing GPT-5.6 Sol: a next-generation model

not much happened today

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI

AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing

RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints

Soft Token Alignment for Cross-Lingual Reasoning

The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report

GAVEL: Grounded Caption Error Verification and Localization

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

The White House is asking OpenAI to slow roll the release of its new model over safety concerns

Do Thinking Tokens Help with Safety?

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity

Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL

Do Thinking Tokens Help with Safety?

Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

RAS: Measuring LLM Safety Through Refusal Alignment

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation