I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Just wanted to share that I was looking for an optimal way to run Ornith 35B in FP8 with E4M3 and MTP with vLLM but there was no out-of-the-box model with MTP drafter support. So I grafted this new model! It's 18% faster than without MTP and the drafter acceptance rate is not bad (70% on avg). It should run on any RTX based setup > 80GB VRAM with full context window 256k. Might also do well on Unified Memory Systems like GB10 (for this - use my script and graft the MTP model into a target NVFP4 model!). I work with Hopper and Ada gen hardware, so this is the Pareto optimal for me. Have fun! Grafter script and vLLM high performance inference container: https://github.com/kyr0/Ornith-35B-FP8-E4M3-MTP [link] [comments] |
More from r/LocalLLaMA
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
-
Making LLMs Better at Creative Writing using Entropy
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.