r/LocalLLaMA · · 2 min read

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between harmful and harmless activation caches, then project it out of the weight matrices.

The problem with vanilla abliteration (as popularized by mlabonne) is benchmark degradation. When you project out a component from weight vectors, you shrink their norms. Applied across hundreds of matrices in a 35B-parameter MoE model, the residual stream magnitudes decay layer by layer. The model gets measurably dumber.

grimjim's norm-preserving biprojection technique fixes this. After orthogonalizing each weight row against the refusal direction, you rescale it back to its original L2 norm. The resulting vector has zero component along r and the same magnitude as the original. Simple but it makes the difference between "works on paper" and "actually passes benchmarks."

I applied this to Qwen3.6-35B-A3B (hybrid MoE with 256 experts + shared expert, mixed standard/linear attention). Two things that break naive scripts silently:

  1. Hybrid attention: some layers use self_attn.o_proj, others use linear_attn.out_proj. Miss the linear attention layers and you get partial abliteration.

  2. 3D expert tensors: routed expert down projections are stored as (n_experts, d_hidden, d_model). Need an einsum ij,ejk->eik to apply the projection per-expert rather than treating it as a single 2D matrix.

Also built an enriched harmful dataset (7356 prompts, 35 categories, 10 prompt styles) because diversity of framing matters more than raw count. If your harmful set is all "how to make a bomb" type prompts, you extract a direction that captures that phrasing pattern, not the actual refusal mechanism.

Results: 0% refusal on held-out test set. Math and code benchmarks intact (the norm preservation is what keeps this working).

Open source:

- Model: Bahushruth/Qwen3.6-35B-A3B-abliterated-v4 (bf16 safetensors)

- GGUF quants: Bahushruth/Qwen3.6-35B-A3B-abliterated-v4-GGUF (Q4_K_M through Q8_0)

- Dataset: Bahushruth/abliteration-harmful-enriched

Full writeup with code, interactive visualizations of the orthogonalization geometry, and layer-wise refusal scores:

https://potatospudowski.github.io/articles/abliteration

Key references that shaped this:

- Arditi et al. "Refusal in Language Models Is Mediated by a Single Direction" (2024)

- grimjim "Norm-preserving biprojected abliteration" (2025)

- Pan et al. "The Hidden Dimensions of LLM Alignment" (ICML 2025) - formally proves refusal is multi-dimensional

- Nanfack et al. "Efficient Refusal Ablation through Optimal Transport" (2026) - alternative approach using Gaussian OT

Happy to discuss the MoE-specific challenges or the dataset construction. The einsum thing in particular cost me a few hours of debugging before I realized the expert weights weren't getting modified.

submitted by /u/BriefCardiologist656
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA