r/LocalLLaMA · · 2 min read

I extended Gemma4-31B to 44B (88 layers) — since Google won't give us anything bigger than 31B

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I extended Gemma4-31B to 44B (88 layers) — since Google won't give us anything bigger than 31B

I've been just sit on this thread for a while now, both as a reader and occasional poster, so I figured it was finally time to share something I've been working on last weekends.

Google hasn't shipped a dense Gemma4 bigger than 31B, so I decided to just build one myself. Heads up though — I'm not a CS or math person, this is all hands-on trial and error on my own hardware. If anything below is theoretically shaky, please tell me, I genuinely want to learn where I'm wrong.

What I did: took Gemma4-31B, expanded it from 60 → 80 layers (identity-init following the LLaMA Pro approach, with a Gemma4-specific layer_scalar fix that took me way too long to track down), fine-tuned it on Korean legal + STEM data, then did a second round of block duplication expansion (80 → 88 layers, ~47B params) on top of the already fine-tuned model instead of the base.

My working theory is that Gemma4's dense architecture packs knowledge really compactly, which makes it surprisingly hard to cram in a genuinely new domain without stepping on what's already there. The layer expansion is basically me trying to buy some "empty capacity" for the new domain to live in, rather than fighting the existing weights for space. Early results for my own legal/STEM use case look promising, though I haven't tested tool calling yet so I can't speak to that.

Full writeup with the architecture details, identity-init verification, and training verification (checked whether the duplicated full-attention layer actually trained vs staying dead weight — it did, actually contributed more than the sliding layers) is on the model card:

🔗 https://huggingface.co/TOTORONG/extGemma4-44B

I'd genuinely love to turn this into more of a collaborative effort going forward, especially around the two weakest spots right now: coding ability and tool-calling. Concretely, a few things I could use help with —

  • CoT datasets geared toward coding and tool-use/function-calling, ideally ones that generalize rather than just memorize a fixed toolset
  • Anyone willing to actually stress-test tool calling on this model and report back, since I haven't gotten to that myself yet
  • Feedback on whether it's worth pushing this expansion further (96–100 layers is on my mind) versus focusing purely on data/training quality at 88 layers
  • If anyone's tried similar block-duplication or layer-insertion expansions on other dense architectures, I'd love to compare notes on what worked and what didn't

Next up, I'm hoping to try applying this same approach to GLM-5.2 or DeepSeek V4-Flash — MoE architectures are a different beast, so any papers, resources, or hard-won knowledge on MoE-specific expansion (upcycling, expert duplication, routing considerations, whatever) are always welcome.

submitted by /u/Desperate-Sir-5088
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA