Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — Machine Learning research 8d ago

Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment

arXiv:2606.24851v1 Announce Type: new Abstract: Fourier Neural Operators (FNO) learn solution operators of partial differential equations by parameterizing global convolutions in the complex Fourier domain. For real-valued PDE solutions, the complex FFT carries representational…

20
arXiv — NLP / Computation & Language research 8d ago

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

arXiv:2606.23700v1 Announce Type: new Abstract: Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model's aligned character rather than direct learning of…

8
arXiv — Machine Learning research 8d ago

Verifiable Foundation Models for Robot Safety

arXiv:2606.23754v1 Announce Type: cross Abstract: Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable…

4
arXiv — Machine Learning research 8d ago

EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

arXiv:2606.24586v1 Announce Type: cross Abstract: Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate…

19
arXiv — Machine Learning research 8d ago

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

arXiv:2606.24601v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target…

29
arXiv — NLP / Computation & Language research 8d ago

One Year Later...The Harms Persist, But So Do We!

arXiv:2606.23884v1 Announce Type: new Abstract: General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical conditions. This study evaluates six proprietary…

26
arXiv — NLP / Computation & Language research 8d ago

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

arXiv:2606.24004v1 Announce Type: new Abstract: Steering a large language model (LLM) toward a desired behavior typically relies on an iterative process of hand-crafting a prompt based on a careful inspection of the model's responses. This is an involved, brittle, and…

28
arXiv — NLP / Computation & Language research 8d ago

Selective Capability Unlearning in End-to-End Spoken Language Understanding

arXiv:2606.24063v1 Announce Type: new Abstract: Modern spoken language understanding (SLU) systems are increasingly deployed in real-world settings, where specific functionalities may need to be removed due to policy or safety constraints. In SLU, a functionality corresponds to…

23
arXiv — NLP / Computation & Language research 8d ago

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

arXiv:2606.24828v1 Announce Type: new Abstract: Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific…

38
arXiv — NLP / Computation & Language research 8d ago

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

arXiv:2606.23885v1 Announce Type: cross Abstract: Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing…

17
arXiv — NLP / Computation & Language research 8d ago

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

arXiv:2606.24014v1 Announce Type: cross Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL),…

5
arXiv — NLP / Computation & Language research 8d ago

Progressive Alignment Objectives for Aligner-Encoder based ASR

arXiv:2606.24147v1 Announce Type: cross Abstract: Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without…

23
arXiv — NLP / Computation & Language research 8d ago

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline…

25
r/LocalLLaMA community 8d ago

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8…

10
OpenAI official-blog 8d ago

Helping build shared standards for advanced AI

OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation.

31
Hugging Face Daily Papers research 9d ago

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Abstract SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-Use Agents (CUAs)…

24
Hugging Face Daily Papers research 9d ago

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Abstract Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal…

34
Hugging Face Daily Papers research 9d ago

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured…

19
Hugging Face Daily Papers research 9d ago

Safe Few-Step Generation via Velocity Editing

Abstract VESFlow is a training-free safety method for flow matching-based text-to-image generation that edits velocity fields to ensure safe output while maintaining prompt integrity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Flow matching has recently emerged as a strong…

16
Hugging Face Daily Papers research 9d ago

Exploring the Design Space of Reward Backpropagation for Flow Matching

Abstract FlowBP addresses limitations in flow matching model alignment by using a surrogate trajectory framework that reduces memory usage and gradient chaining while maintaining performance across multiple text-to-image models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

23
Latent.Space news-outlet 9d ago

Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

OpenAI boardmember Zico Kolter and Gray Swan CEO Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI”

22
NVIDIA Developer Blog official-blog 9d ago

Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI

Physical AI—robots working autonomously alongside people in factories, warehouses, hospitals, and homes—is arriving faster than most expected. Traditional...

12
r/LocalLLaMA community 10d ago

Qwen 3.6 27b Abliterated (apostate)

I've been working on a project called Apostate and have finally released my first large model with it on Hugging Face. Qwen 3.6 27B with safety alignment removed down from 92% to 7.6% refusal rate with minimal impact on the model's capabilities (0.120 KL). Qwen 3.6 27B Apostate…

17
Don't Worry About the Vase community 12d ago

Claude Fable 5 and Mythos 5: Capabilities

Only three days after the release of Claude Fable 5, Anthropic was forced by the United States Government to make it unavailable, when a jailbreak was brought to its attention, rather than the previous situation of ‘yes obviously experts can jailbreak anything if they care…

32
Hugging Face Daily Papers research 13d ago

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Abstract Hybrid linear attention models can be improved through a novel initialization technique that enhances conversion from pretrained Transformers by leveraging teacher attention statistics and alignment steps. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Hybrid linear…

6
Hugging Face Daily Papers research 13d ago

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

Abstract FlowBender is a closed-loop framework that addresses constraint satisfaction in diffusion and flow models by training networks to correct alignment errors using inference-time feedback, outperforming traditional supervised and guidance-based approaches across multiple…

11
arXiv — Machine Learning research 13d ago

When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting

arXiv:2606.19363v1 Announce Type: new Abstract: The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when…

32
arXiv — Machine Learning research 13d ago

Tracking Representation Dynamics in Large Language Models with Persistent Homology

arXiv:2606.19542v1 Announce Type: new Abstract: Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking…

38
arXiv — Machine Learning research 13d ago

On the QUEST for Uncertainty Quantification via Highest Density Regions

arXiv:2606.19569v1 Announce Type: new Abstract: Uncertainty quantification (UQ) is essential for reliable decision-making in safety-critical applications in probabilistic machine learning. For regression problems, dominant scalar UQ approaches - notably, those based on proper…

23
arXiv — Machine Learning research 13d ago

Shifting-based Optimizable Linear Relaxations for General Activation Functions

arXiv:2606.20292v1 Announce Type: new Abstract: The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of…

10
arXiv — NLP / Computation & Language research 13d ago

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

arXiv:2606.19346v1 Announce Type: new Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and…

6
arXiv — NLP / Computation & Language research 13d ago

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

arXiv:2606.19864v1 Announce Type: new Abstract: The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns…

12
arXiv — NLP / Computation & Language research 13d ago

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

arXiv:2606.20225v1 Announce Type: new Abstract: Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared…

31
arXiv — NLP / Computation & Language research 13d ago

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

arXiv:2606.20482v1 Announce Type: new Abstract: To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations.…

7
arXiv — NLP / Computation & Language research 13d ago

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving…

17
arXiv — NLP / Computation & Language research 13d ago

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

arXiv:2606.20205v1 Announce Type: cross Abstract: Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in…

34
arXiv — NLP / Computation & Language research 13d ago

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

arXiv:2512.03818v2 Announce Type: replace Abstract: Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording…

33
arXiv — NLP / Computation & Language research 13d ago

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual…

7
r/MachineLearning community 13d ago

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU." As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified…

29
Hugging Face Daily Papers research 13d ago

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Abstract Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human…

4
r/MachineLearning community 13d ago

HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]

TL;DR for ML Specialists: The Core: An empirical study on how long, semantically dense, completely benign text (with zero triggers, instructions, or jailbreak prompts) drives an implicit shift in the model's latent space trajectories. The Effect: Dilution of the initial system…

24
Hugging Face Daily Papers research 14d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems…

28
arXiv — Machine Learning research 14d ago

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

arXiv:2606.18308v1 Announce Type: new Abstract: Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these…

35
arXiv — Machine Learning research 14d ago

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

arXiv:2606.18703v1 Announce Type: new Abstract: Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet…

17
arXiv — Machine Learning research 14d ago

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target…

14
arXiv — NLP / Computation & Language research 14d ago

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

arXiv:2606.18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded…

5
arXiv — NLP / Computation & Language research 14d ago

Steerable Cultural Preference Optimization of Reward Models

arXiv:2606.18606v1 Announce Type: new Abstract: It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on…

16
arXiv — NLP / Computation & Language research 14d ago

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

arXiv:2606.18656v1 Announce Type: new Abstract: Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper…

22
arXiv — NLP / Computation & Language research 14d ago

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

arXiv:2606.18986v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series…

10
arXiv — NLP / Computation & Language research 14d ago

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is…

6

Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

Verifiable Foundation Models for Robot Safety

EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

One Year Later...The Harms Persist, But So Do We!

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

Selective Capability Unlearning in End-to-End Spoken Language Understanding

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

Progressive Alignment Objectives for Aligner-Encoder based ASR

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

Helping build shared standards for advanced AI

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Safe Few-Step Generation via Velocity Editing

Exploring the Design Space of Reward Backpropagation for Flow Matching

Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI

Qwen 3.6 27b Abliterated (apostate)

Claude Fable 5 and Mythos 5: Capabilities

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting

Tracking Representation Dynamics in Large Language Models with Persistent Homology

On the QUEST for Uncertainty Quantification via Highest Density Regions

Shifting-based Optimizable Linear Relaxations for General Activation Functions

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Steerable Cultural Preference Optimization of Reward Models

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment