qwen 3.6 27B AR-> Diffusion - local training on 5090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get the thing to do a forward pass on a 5090 with help of another gpu rtx4000 to help offload recreations. Below are some low level ramblings / findings / observations. Firstly - the amount of vram normally required to do this > 600gb - (i think) after some wrangling - and giving up on optane route - it's possible to train on qlora form factor which will actually take the model and train on nvidia - nvfp4 i attempt to get the entire 27b model to train on a 5090 https://github.com/scrya-com/dLLM-castlehill latest training run https://wandb.ai/snoozie/open-dllm-27b/runs/arcefpjp?nw=nwusersnoozie Public service annoucment - to avoid burning cables - throttle down nvidia max power for consumer 5090 cards from 600w -> 400w The vanilla route with open-dllm is validated on qwen 2.5 with 4x speed up (if someone with lots of compute could take a look it might just work) - I take some deviation to explore improving this - and found a few papers. One is d3llm Ultra-Fast Diffusion LLM https://github.com/hao-ai-lab/d3LLM which boasts faster diffusion speeds - so i upstream this code into the codebase and include their mdm loss - seems ok. It's basically also taking the order of the tokens into account. With the diffusion it can have many steps (see graph) but we can shorten that time to see much higher throughput / tokens per second. if we could theoretically do 1 step - then you may see some crazy speeds. https://wandb.ai/snoozie/open-dllm-compare?nw=nwusersnoozie When i was working on improving ltx2 to speed up video recreation to do 1 shot diffusion - I attempt to implement this trick shot based off a paper variational flow maps which / make some noise see here https://github.com/johndpope/ltx2-castlehill https://wandb.ai/snoozie/vfm-v4a?nw=nwusersnoozie This was built to do 1 step image generation by basically crafting noise that almost looks like the image. In a similiar way - this can be done with the text to help reduce the steps of denoising. VFM https://github.com/scrya-com/dLLM-castlehill/issues/2 https://github.com/pengzhangzhi/Open-dLLM/issues/31 UPDATE 1) for open-dllm - you have to calculate the anchors from the teacher model - 64 layers from some response. or 2) for the d3llm - we calculate the trajectories and use for training. there's helper scripts to do both - the agents / claude would help any claude / grok. I'm enjoying opencode.ai - you can get a long way for very little expense - im on the $5 /mth plan https://opencode.ai/go?ref=7C4F1XYS01 [link] [comments] |
More from r/LocalLLaMA
-
Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith
Jul 2
-
July 4th is coming up, is there any vision model that's good for picking up fire?
Jul 2
-
It's officially over. One of the fathers of AI at Nvidia doesn't believe in AGI and compares OpenAI and Anthropic's closed models to AOL and Prodigy's closed internets. Says the future is every business having a customized open source model.
Jul 2
-
6x P40 running Minimax M2.7_Q3_XL
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.