Rebuilding Gemma 4 31b... better... As 26b...
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Sooo... I decided screw it. I'm going to rebuild Gemma 4 31b.
I really like the model. So the current plan is to rebuild the SWA layers.
Currently running all the proper ablation tests to figure out what SWA layer gets removed. Gemma runs 5 SWA at 1024 tokens each. Then a global layer for the "Block"
Layer 3 is consistently the weakest and will likely get removed.
From there I am going to rescale the attention of SWA across the board. The new SWA will be 1024/2048/4096/8.1k then the global layer. This is the "Block" that Gemma uses.
After that, I'm going to bolt on "Attention based Residual Networks"... Moonshot developed this. The research paper is early 2026 I think. I've barely slept working on this so my date might be wrong on that paper.
Anyways, the global layers in the network are going to get attention based residuals that allow global layers to better flow information across them. In theory this gives the model better global coherence and makes it perform better, while smaller.
Given that I don't have the complete IT / RL pipeline that Google invests millions in... I have to work from the IT base.
So for initial rebuilding, I'll take the topK 12? or 20? logits from the 31b model and use them as targets for retraining while freezing the top and bottom of the model. This will keep tokenization/output/vocab from moving while the internals of the network find stability in a smaller space looking like 31b.
The TopK rebuilding is another weird technique I developed in another training spot. It's cool because it teaches the model a vastly richer understanding of what the next token might be and what is adjacent, etc... I don't know if I invented the method or just came to the conclusion someone else did. Probably both.
LASTLY it's feeding it a few billion tokens to rebuild it. I have to find a "good" dataset to use or... literally build the dataset.
The actual full retraining is going to cost money but whatever. I'll hit that wall when I hit it. I'm pretty sure I can just spot price a B300 and train on it.
The model should go from Total Parameters ~30.81B ~26.02B
Theoretically should be BETTER too. Better long context, etc.
If you have good datasets, compute, etc you want to donate... hmu... If you just have questions about how or why this all works... Ask away. I can sit and answer them because staring at a TQDM bar of progress doesn't take a lot of mental effort.
I'll respond after I wake up from the coma I'm about to go in to. (Sleep 8 hours+)
Here's the pastebin for the project -- https://pastebin.com/GbVtJQJg
It's the markdown of the whole plan more or less. Start to finish. This is STARTING from the abliterated core. I have zero desire to add censorship of any form in to this model in training. If you hurt yourself using a model, it's your fault. I'm likely to rebuild the "thinking" training too which means uncensoring it. Having it stop asking about the "safety" of every request in thinking. This might be easier said than done. Still WIP.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.