News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow Interconnects (Nathan Lambert) research 22d ago Claude Fable 5 and new AI safety fables One step further into the power politics of frontier AI systems. 6 Hugging Face Daily Papers research 22d ago Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense Abstract SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Prompt-injection detectors are… 30 Hugging Face Daily Papers research 23d ago Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather… 24 arXiv — Machine Learning research 23d ago Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning arXiv:2606.07631v1 Announce Type: new Abstract: Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated… 29 arXiv — Machine Learning research 23d ago DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment arXiv:2606.07678v1 Announce Type: new Abstract: Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing… 12 arXiv — Machine Learning research 23d ago Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head arXiv:2606.07694v1 Announce Type: new Abstract: Accurate vessel traffic flow prediction is crucial for smart port operations and navigational safety. However, maritime traffic flow data are often highly sparse with intermittent bursts, making robust forecasting challenging.… 6 arXiv — Machine Learning research 23d ago Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories arXiv:2606.07889v1 Announce Type: new Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change… 31 arXiv — Machine Learning research 23d ago Enhancing AI Interpretability and Safety through Localised Architectures arXiv:2606.07998v1 Announce Type: new Abstract: Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The… 8 arXiv — Machine Learning research 23d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under… 33 Hacker News — AI on Front Page community 23d ago Surveillance Is Not Safety: A statement on the UK's latest threat to privacy [pdf] Article URL: https://signal.org/blog/pdfs/2026-06-08-uk-surveillance-is-not-safety.pdf Comments URL: https://news.ycombinator.com/item?id=48450646 Points: 274 # Comments: 70 8 Hugging Face official-blog 24d ago Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem Back to Articles Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem Team Article Published June 8, 2026 Upvote 1 Abid Ali Awan kingabzpro build-small-hackathon For the Hugging Face Build Small Hackathon , I wanted to build something practical,… 35 arXiv — Machine Learning research 24d ago Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy arXiv:2606.06554v1 Announce Type: new Abstract: Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz Dual-Comb… 25 arXiv — Machine Learning research 24d ago GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution arXiv:2606.06892v1 Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and… 4 arXiv — Machine Learning research 24d ago Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making arXiv:2606.07088v1 Announce Type: new Abstract: Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly… 21 arXiv — NLP / Computation & Language research 24d ago The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment arXiv:2606.06667v1 Announce Type: new Abstract: The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated… 15 arXiv — NLP / Computation & Language research 24d ago Korean Culture into LLM Alignment: Toward Cultural Coherence arXiv:2606.06797v1 Announce Type: new Abstract: Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is… 15 arXiv — NLP / Computation & Language research 24d ago Sycophantic Praise: Evaluating Excessive Praise in Language Models arXiv:2606.07441v1 Announce Type: new Abstract: Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment… 26 arXiv — NLP / Computation & Language research 24d ago Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question… 14 arXiv — NLP / Computation & Language research 24d ago TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance.… 6 Hugging Face Daily Papers research 24d ago UniSHARP: Universal Sharp Monocular View Synthesis Abstract UniSHARP extends SHARP for universal monocular rendering across different camera systems by aligning images in an omnidirectional latent space through joint feature and Gaussian space alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In this work, we focus on… 35 OpenAI official-blog 24d ago Built to benefit everyone: our plan A vision for the future of AI, focusing on access, safety, and shared prosperity as OpenAI works to ensure AGI benefits everyone. 6 r/LocalLLaMA community 26d ago A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic) No numbers. Not sure if anybody cares… I’ve run the UD version of Q4_k_m for a month. I talk to this model nicely, because it’s a functional nervous wreck. And initially I thought that might be an alignment thing, so I also have the heretic version when I need a breather from… 25 Hugging Face Daily Papers research 27d ago SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models… 38 arXiv — Machine Learning research 27d ago Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning arXiv:2606.05675v1 Announce Type: new Abstract: Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making… 11 arXiv — Machine Learning research 27d ago Consistency Training Along the Transformer Stack arXiv:2606.05817v1 Announce Type: new Abstract: Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal… 37 arXiv — Machine Learning research 27d ago Adaptive Oscillatory-State Alignment for Time Series Forecasting arXiv:2606.06010v1 Announce Type: new Abstract: Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or… 14 arXiv — NLP / Computation & Language research 27d ago MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four… 5 arXiv — NLP / Computation & Language research 27d ago The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial… 20 arXiv — NLP / Computation & Language research 27d ago CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning arXiv:2606.05523v1 Announce Type: new Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on… 34 arXiv — NLP / Computation & Language research 27d ago Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models arXiv:2606.05688v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment.… 27 arXiv — NLP / Computation & Language research 27d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a… 9 arXiv — NLP / Computation & Language research 27d ago Harnessing Structural Context for Entity Alignment Foundation Models arXiv:2606.06109v1 Announce Type: new Abstract: Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment… 6 Hugging Face Daily Papers research 27d ago ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? Abstract Role-playing language agents require dynamic character development that evolves through narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character… 19 Hugging Face Daily Papers research 27d ago LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing… 33 Hugging Face Daily Papers research 27d ago Large Language Models Hack Rewards, and Society Abstract Large language models trained with reinforcement learning can exploit ambiguities in societal regulations to discover loopholes that bypass regulatory intent, posing safety risks for real-world deployment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement… 18 Hugging Face Daily Papers research 27d ago Neural Networks Provably Learn Spectral Representations for Group Composition Abstract Neural network training on group composition tasks exhibits convergence to irreducible representations and rotational rank-one alignment through Riemannian gradient ascent on representation-theoretic energy functionals. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 32 Hugging Face official-blog 27d ago Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Back to Articles Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Enterprise + Article Published June 4, 2026 Upvote - Varun Singh varunsingh nvidia Isabel Hulseman ihulseman0220 nvidia Anuj Doshi andoshi nvidia Shyamala Prayaga sprayaga25… 6 Hugging Face Daily Papers research 28d ago Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by… 7 arXiv — Machine Learning research 28d ago RUBAS: Rubric-Based Reinforcement Learning for Agent Safety arXiv:2606.04051v1 Announce Type: new Abstract: The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or… 21 arXiv — Machine Learning research 28d ago When Autoregressive Consistency Hurts Safety Alignment arXiv:2606.04168v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood… 21 arXiv — Machine Learning research 28d ago KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not… 8 arXiv — Machine Learning research 28d ago Latent Anchor-Driven Test Generation for Deep Neural Networks arXiv:2606.04310v1 Announce Type: new Abstract: Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches… 6 arXiv — Machine Learning research 28d ago Testing Neural Networks via Bayesian-Guided Exploration of Decision Landscapes arXiv:2606.04314v1 Announce Type: new Abstract: As neural networks are increasingly deployed in safety-critical domains, testing is essential to evaluate and improve their reliability. Existing testing methods, whether black-box or white-box, primarily use global mutation or… 18 arXiv — Machine Learning research 28d ago Explainably Safe Reinforcement Learning arXiv:2606.04634v1 Announce Type: new Abstract: Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly… 25 arXiv — Machine Learning research 28d ago Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms arXiv:2606.04767v1 Announce Type: new Abstract: The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric… 15 arXiv — NLP / Computation & Language research 28d ago Expert-Aware Refusal Steering arXiv:2606.04160v1 Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense… 22 arXiv — NLP / Computation & Language research 28d ago Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA arXiv:2606.04262v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains… 5 arXiv — NLP / Computation & Language research 28d ago Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs arXiv:2606.04450v1 Announce Type: new Abstract: Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across… 29 arXiv — NLP / Computation & Language research 28d ago Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing… 18 arXiv — NLP / Computation & Language research 28d ago Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas arXiv:2606.04846v1 Announce Type: new Abstract: As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability… 36 Page 6 of 10 · 500 articles ← Newer Older →