I extended Gemma4-31B to 44B (88 layers) — since Google won't give us anything bigger than 31B
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I've been just sit on this thread for a while now, both as a reader and occasional poster, so I figured it was finally time to share something I've been working on last weekends. Google hasn't shipped a dense Gemma4 bigger than 31B, so I decided to just build one myself. Heads up though — I'm not a CS or math person, this is all hands-on trial and error on my own hardware. If anything below is theoretically shaky, please tell me, I genuinely want to learn where I'm wrong. What I did: took Gemma4-31B, expanded it from 60 → 80 layers (identity-init following the LLaMA Pro approach, with a Gemma4-specific My working theory is that Gemma4's dense architecture packs knowledge really compactly, which makes it surprisingly hard to cram in a genuinely new domain without stepping on what's already there. The layer expansion is basically me trying to buy some "empty capacity" for the new domain to live in, rather than fighting the existing weights for space. Early results for my own legal/STEM use case look promising, though I haven't tested tool calling yet so I can't speak to that. Full writeup with the architecture details, identity-init verification, and training verification (checked whether the duplicated full-attention layer actually trained vs staying dead weight — it did, actually contributed more than the sliding layers) is on the model card: 🔗 https://huggingface.co/TOTORONG/extGemma4-44B I'd genuinely love to turn this into more of a collaborative effort going forward, especially around the two weakest spots right now: coding ability and tool-calling. Concretely, a few things I could use help with —
Next up, I'm hoping to try applying this same approach to GLM-5.2 or DeepSeek V4-Flash — MoE architectures are a different beast, so any papers, resources, or hard-won knowledge on MoE-specific expansion (upcycling, expert duplication, routing considerations, whatever) are always welcome. [link] [comments] |
More from r/LocalLLaMA
-
I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3
Jul 2
-
Senior SWE Bench: a new benchmark focussed on realistically underspecified feature tasks
Jul 1
-
My reasons to run local models
Jul 1
-
End of an Agony. Real production service that uses LLM to earn money my team had made and now we are so happy that it will die. Here are some of my final "experiences".
Jul 1
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.