Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between harmful and harmless activation caches, then project it out of the weight matrices. The problem with vanilla abliteration (as popularized by mlabonne) is benchmark degradation. When you project out a component from weight vectors, you shrink their norms. Applied across hundreds of matrices in a 35B-parameter MoE model, the residual stream magnitudes decay layer by layer. The model gets measurably dumber. grimjim's norm-preserving biprojection technique fixes this. After orthogonalizing each weight row against the refusal direction, you rescale it back to its original L2 norm. The resulting vector has zero component along r and the same magnitude as the original. Simple but it makes the difference between "works on paper" and "actually passes benchmarks." I applied this to Qwen3.6-35B-A3B (hybrid MoE with 256 experts + shared expert, mixed standard/linear attention). Two things that break naive scripts silently:
Also built an enriched harmful dataset (7356 prompts, 35 categories, 10 prompt styles) because diversity of framing matters more than raw count. If your harmful set is all "how to make a bomb" type prompts, you extract a direction that captures that phrasing pattern, not the actual refusal mechanism. Results: 0% refusal on held-out test set. Math and code benchmarks intact (the norm preservation is what keeps this working). Open source: - Model: Bahushruth/Qwen3.6-35B-A3B-abliterated-v4 (bf16 safetensors) - GGUF quants: Bahushruth/Qwen3.6-35B-A3B-abliterated-v4-GGUF (Q4_K_M through Q8_0) - Dataset: Bahushruth/abliteration-harmful-enriched Full writeup with code, interactive visualizations of the orthogonalization geometry, and layer-wise refusal scores: https://potatospudowski.github.io/articles/abliteration Key references that shaped this: - Arditi et al. "Refusal in Language Models Is Mediated by a Single Direction" (2024) - grimjim "Norm-preserving biprojected abliteration" (2025) - Pan et al. "The Hidden Dimensions of LLM Alignment" (ICML 2025) - formally proves refusal is multi-dimensional - Nanfack et al. "Efficient Refusal Ablation through Optimal Transport" (2026) - alternative approach using Gaussian OT Happy to discuss the MoE-specific challenges or the dataset construction. The einsum thing in particular cost me a few hours of debugging before I realized the expert weights weren't getting modified. [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.