MTP has no impact on my Qwen3.6 MoE performance
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hello I have an rtx 5060Ti and I tried running unsloth's Qwen3.6-35B GGUF with MTP. However in both cases I have around 60 tok/s.
Here are my flags:
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --alias unsloth/Qwen3.6 --port 8002 --kv-unified --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --fit on --no-mmproj --ctx-size 64000 For the MTP variant of course I add the following as per the unsloth guide.
--spec-type draft-mtp --spec-draft-n-max 2 --presence-penalty 1.5
I tried to reduce the ctx size, remove cache quantization, add `--no-mmap` and although the speed changes slightly, it remains the same between MTP/non MTP. I thought it was supposed to offer a speedup.
Anybody has an idea why?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.