r/MachineLearning · July 3, 2026 · 1 min read

What does "Safe AI" look like? [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

For open-weight LLMs, how practical is it to study defenses against post-release fine-tuning that weakens refusal or safety behavior?

I've been seeing “uncensored” or “heretic” variants of new models appear very quickly after release, which raises a question I’m curious about: is fine-tuning resistance a meaningful safety goal for open-weight releases, or is it too narrow because determined users can always modify weights, switch models, or use other workarounds?

And to a larger extent, is current safety training even worth the cost and effort if it takes 30 minutes and an automated script to break the model?

I’m not asking about a specific method, just the threat model. What would count as a useful practical win here? For example, would increasing attacker cost or making safety removal less reliable be valuable, even if perfect prevention is impossible?

Curious how people think about this from a model release, governance, and AI safety perspective.

submitted by /u/Aaron_Rock
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning