What does "Safe AI" look like? [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
For open-weight LLMs, how practical is it to study defenses against post-release fine-tuning that weakens refusal or safety behavior?
I've been seeing “uncensored” or “heretic” variants of new models appear very quickly after release, which raises a question I’m curious about: is fine-tuning resistance a meaningful safety goal for open-weight releases, or is it too narrow because determined users can always modify weights, switch models, or use other workarounds?
And to a larger extent, is current safety training even worth the cost and effort if it takes 30 minutes and an automated script to break the model?
I’m not asking about a specific method, just the threat model. What would count as a useful practical win here? For example, would increasing attacker cost or making safety removal less reliable be valuable, even if perfect prevention is impossible?
Curious how people think about this from a model release, governance, and AI safety perspective.
[link] [comments]
More from r/MachineLearning
-
Contrastive Decoding Diffing (CDD): recovering verbatim finetuning data from logits alone, no weight access needed[R]
Jul 3
-
Small Language Model SLM [D]
Jul 3
-
I am a high school student and I want to get into ML, but I don't know where to start? [D]
Jul 3
-
Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model [P]
Jul 3
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.