News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — Machine Learning research 8d ago Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment arXiv:2606.24851v1 Announce Type: new Abstract: Fourier Neural Operators (FNO) learn solution operators of partial differential equations by parameterizing global convolutions in the complex Fourier domain. For real-valued PDE solutions, the complex FFT carries representational… 20 arXiv — NLP / Computation & Language research 8d ago Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment arXiv:2606.23700v1 Announce Type: new Abstract: Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model's aligned character rather than direct learning of… 8 arXiv — Machine Learning research 8d ago Verifiable Foundation Models for Robot Safety arXiv:2606.23754v1 Announce Type: cross Abstract: Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable… 4 arXiv — Machine Learning research 8d ago EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics arXiv:2606.24586v1 Announce Type: cross Abstract: Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate… 19 arXiv — Machine Learning research 8d ago ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning arXiv:2606.24601v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target… 29 arXiv — NLP / Computation & Language research 8d ago One Year Later...The Harms Persist, But So Do We! arXiv:2606.23884v1 Announce Type: new Abstract: General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical conditions. This study evaluates six proprietary… 26 arXiv — NLP / Computation & Language research 8d ago Towards Spec Learning: Inference-Time Alignment from Preference Pairs arXiv:2606.24004v1 Announce Type: new Abstract: Steering a large language model (LLM) toward a desired behavior typically relies on an iterative process of hand-crafting a prompt based on a careful inspection of the model's responses. This is an involved, brittle, and… 28 arXiv — NLP / Computation & Language research 8d ago Selective Capability Unlearning in End-to-End Spoken Language Understanding arXiv:2606.24063v1 Announce Type: new Abstract: Modern spoken language understanding (SLU) systems are increasingly deployed in real-world settings, where specific functionalities may need to be removed due to policy or safety constraints. In SLU, a functionality corresponds to… 23 arXiv — NLP / Computation & Language research 8d ago Less is More: Quality-Aware Training Data Selection for Scientific Summarization arXiv:2606.24828v1 Announce Type: new Abstract: Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific… 38 arXiv — NLP / Computation & Language research 8d ago Mind the Heads: Topological Representation Alignment for Multimodal LLMs arXiv:2606.23885v1 Announce Type: cross Abstract: Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing… 17 arXiv — NLP / Computation & Language research 8d ago Reinforcement Learning Towards Broadly and Persistently Beneficial Models arXiv:2606.24014v1 Announce Type: cross Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL),… 5 arXiv — NLP / Computation & Language research 8d ago Progressive Alignment Objectives for Aligner-Encoder based ASR arXiv:2606.24147v1 Announce Type: cross Abstract: Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without… 23 arXiv — NLP / Computation & Language research 8d ago AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline… 25 r/LocalLLaMA community 8d ago I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention. I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8… 10 OpenAI official-blog 8d ago Helping build shared standards for advanced AI OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation. 31 Hugging Face Daily Papers research 9d ago SkillHarness: Harnessing Safe Skills for Computer-Use Agents Abstract SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-Use Agents (CUAs)… 24 Hugging Face Daily Papers research 9d ago Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding Abstract Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal… 34 Hugging Face Daily Papers research 9d ago DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured… 19 Hugging Face Daily Papers research 9d ago Safe Few-Step Generation via Velocity Editing Abstract VESFlow is a training-free safety method for flow matching-based text-to-image generation that edits velocity fields to ensure safe output while maintaining prompt integrity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Flow matching has recently emerged as a strong… 16 Hugging Face Daily Papers research 9d ago Exploring the Design Space of Reward Backpropagation for Flow Matching Abstract FlowBP addresses limitations in flow matching model alignment by using a surrogate trajectory framework that reduces memory usage and gradient chaining while maintaining performance across multiple text-to-image models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 23 Latent.Space news-outlet 9d ago Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan OpenAI boardmember Zico Kolter and Gray Swan CEO Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI” 22 NVIDIA Developer Blog official-blog 9d ago Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI Physical AI—robots working autonomously alongside people in factories, warehouses, hospitals, and homes—is arriving faster than most expected. Traditional... 12 r/LocalLLaMA community 10d ago Qwen 3.6 27b Abliterated (apostate) I've been working on a project called Apostate and have finally released my first large model with it on Hugging Face. Qwen 3.6 27B with safety alignment removed down from 92% to 7.6% refusal rate with minimal impact on the model's capabilities (0.120 KL). Qwen 3.6 27B Apostate… 17 Don't Worry About the Vase community 12d ago Claude Fable 5 and Mythos 5: Capabilities Only three days after the release of Claude Fable 5, Anthropic was forced by the United States Government to make it unavailable, when a jailbreak was brought to its attention, rather than the previous situation of ‘yes obviously experts can jailbreak anything if they care… 32 Hugging Face Daily Papers research 13d ago Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation Abstract Hybrid linear attention models can be improved through a novel initialization technique that enhances conversion from pretrained Transformers by leveraging teacher attention statistics and alignment steps. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Hybrid linear… 6 Hugging Face Daily Papers research 13d ago FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows Abstract FlowBender is a closed-loop framework that addresses constraint satisfaction in diffusion and flow models by training networks to correct alignment errors using inference-time feedback, outperforming traditional supervised and guidance-based approaches across multiple… 11 arXiv — Machine Learning research 13d ago When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting arXiv:2606.19363v1 Announce Type: new Abstract: The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when… 32 arXiv — Machine Learning research 13d ago Tracking Representation Dynamics in Large Language Models with Persistent Homology arXiv:2606.19542v1 Announce Type: new Abstract: Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking… 38 arXiv — Machine Learning research 13d ago On the QUEST for Uncertainty Quantification via Highest Density Regions arXiv:2606.19569v1 Announce Type: new Abstract: Uncertainty quantification (UQ) is essential for reliable decision-making in safety-critical applications in probabilistic machine learning. For regression problems, dominant scalar UQ approaches - notably, those based on proper… 23 arXiv — Machine Learning research 13d ago Shifting-based Optimizable Linear Relaxations for General Activation Functions arXiv:2606.20292v1 Announce Type: new Abstract: The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of… 10 arXiv — NLP / Computation & Language research 13d ago Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer arXiv:2606.19346v1 Announce Type: new Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and… 6 arXiv — NLP / Computation & Language research 13d ago The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI arXiv:2606.19864v1 Announce Type: new Abstract: The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns… 12 arXiv — NLP / Computation & Language research 13d ago Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families arXiv:2606.20225v1 Announce Type: new Abstract: Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared… 31 arXiv — NLP / Computation & Language research 13d ago Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users arXiv:2606.20482v1 Announce Type: new Abstract: To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations.… 7 arXiv — NLP / Computation & Language research 13d ago When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving… 17 arXiv — NLP / Computation & Language research 13d ago Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact arXiv:2606.20205v1 Announce Type: cross Abstract: Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in… 34 arXiv — NLP / Computation & Language research 13d ago Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology arXiv:2512.03818v2 Announce Type: replace Abstract: Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording… 33 arXiv — NLP / Computation & Language research 13d ago Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual… 7 r/MachineLearning community 13d ago Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R] I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU." As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified… 29 Hugging Face Daily Papers research 13d ago The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL Abstract Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human… 4 r/MachineLearning community 13d ago HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D] TL;DR for ML Specialists: The Core: An empirical study on how long, semantically dense, completely benign text (with zero triggers, instructions, or jailbreak prompts) drives an implicit shift in the model's latent space trajectories. The Effect: Dilution of the initial system… 24 Hugging Face Daily Papers research 14d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems… 28 arXiv — Machine Learning research 14d ago TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning arXiv:2606.18308v1 Announce Type: new Abstract: Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these… 35 arXiv — Machine Learning research 14d ago Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment arXiv:2606.18703v1 Announce Type: new Abstract: Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet… 17 arXiv — Machine Learning research 14d ago Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target… 14 arXiv — NLP / Computation & Language research 14d ago Montreal Forced Aligner and the state of speech-to-text alignment in 2026 arXiv:2606.18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded… 5 arXiv — NLP / Computation & Language research 14d ago Steerable Cultural Preference Optimization of Reward Models arXiv:2606.18606v1 Announce Type: new Abstract: It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on… 16 arXiv — NLP / Computation & Language research 14d ago The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs arXiv:2606.18656v1 Announce Type: new Abstract: Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper… 22 arXiv — NLP / Computation & Language research 14d ago Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering arXiv:2606.18986v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series… 10 arXiv — NLP / Computation & Language research 14d ago G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is… 6 Page 3 of 10 · 500 articles ← Newer Older →