Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s:
bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ -fit on \ -c 131072 \ -fitt 3000 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -n -1 \ -fa on \ --repeat-penalty 1.0
But if I remove these 2 params - it shoots up to 475W and I get 70 t/s:
--spec-type draft-mtp \ --spec-draft-n-max 2 \
I tried changing spec-draft-n-max for 1,2,4 and getting the same results. I also am getting decent acceptance rate (> 50%).
My test prompt is - 1000 words like roald dahl.
What is going on? I swear this was giving me 100+ t/s until 2 days ago. I might have synced llama.cpp to head and re-compiled, but not entirely sure.
[link] [comments]
More from r/LocalLLaMA
-
6x P40 running Minimax M2.7_Q3_XL
Jul 2
-
Fine-tuned Gemma-4-31B specifically for Copywriting & Creative Writing Tasks (Scored +290 Elo over base using EqBench3)
Jul 2
-
Gemma 4 WebGPU Kernels 255 tok/s by x/@xenovacom
Jul 2
-
openlumara, my manually coded super-token-efficient harness, now works across any UI that can connect to an openAI endpoint! koboldlite, openwebui, you name it. basically, openAI bridge. yay!
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.