r/LocalLLaMA · · 2 min read

Best tps can I get with Qwen3.5 122B on 32GB VRAM + 64GB RAM?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

My attempt at running Qwen3.5 122B on my 5090 (32GB VRAM) + 64GB RAM is really bleak. I'm getting a speed that starts at 6 tps and ends at ~20 tps. Can I improve this further?

build/bin/llama-server \ -m ~/myp/models/unsloth/qwen3.5/Q5_K_S/Qwen3.5-122B-A10B-Q5_K_S-00001-of-00003.gguf \ --temp 0.6 \ --top_p 0.95 \ --top_k 20 \ --min_p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ -c 100000 \ -t 16 \ -ngl 99 \ --flash-attn on \ --host 0.0.0.0 --port 8080 \ --no-mmproj --parallel 1 --chat-template-kwargs '{"enable_thinking": true}' -ncmoe 35

0.30.172.197 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0 0.31.613.986 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 32 (pos_min = 6, pos_max = 6, n_tokens = 7, size = 149.063 MiB) 0.48.033.184 I slot print_timing: id 0 | task 0 | n_decoded = 100, tg = 6.21 t/s, tg_3s = 6.21 t/s 0.51.174.776 I slot print_timing: id 0 | task 0 | n_decoded = 120, tg = 6.24 t/s, tg_3s = 6.37 t/s 0.54.338.404 I slot print_timing: id 0 | task 0 | n_decoded = 143, tg = 6.38 t/s, tg_3s = 7.27 t/s 0.57.430.775 I slot print_timing: id 0 | task 0 | n_decoded = 172, tg = 6.75 t/s, tg_3s = 9.38 t/s 1.00.583.009 I slot print_timing: id 0 | task 0 | n_decoded = 204, tg = 7.12 t/s, tg_3s = 10.15 t/s 1.03.616.932 I slot print_timing: id 0 | task 0 | n_decoded = 235, tg = 7.42 t/s, tg_3s = 10.22 t/s 1.06.667.693 I slot print_timing: id 0 | task 0 | n_decoded = 268, tg = 7.72 t/s, tg_3s = 10.82 t/s 1.09.733.669 I slot print_timing: id 0 | task 0 | n_decoded = 302, tg = 7.99 t/s, tg_3s = 11.09 t/s 1.12.753.794 I slot print_timing: id 0 | task 0 | n_decoded = 343, tg = 8.40 t/s, tg_3s = 13.58 t/s 1.15.796.782 I slot print_timing: id 0 | task 0 | n_decoded = 386, tg = 8.80 t/s, tg_3s = 14.13 t/s 1.18.826.330 I slot print_timing: id 0 | task 0 | n_decoded = 439, tg = 9.36 t/s, tg_3s = 17.49 t/s 1.21.873.427 I slot print_timing: id 0 | task 0 | n_decoded = 491, tg = 9.83 t/s, tg_3s = 17.07 t/s 1.24.890.649 I slot print_timing: id 0 | task 0 | n_decoded = 550, tg = 10.39 t/s, tg_3s = 19.55 t/s 1.27.892.235 I slot print_timing: id 0 | task 0 | n_decoded = 609, tg = 10.88 t/s, tg_3s = 19.66 t/s 1.30.903.263 I slot print_timing: id 0 | task 0 | n_decoded = 668, tg = 11.33 t/s, tg_3s = 19.59 t/s 1.34.030.391 I slot print_timing: id 0 | task 0 | n_decoded = 729, tg = 11.74 t/s, tg_3s = 19.51 t/s 1.37.055.301 I slot print_timing: id 0 | task 0 | n_decoded = 792, tg = 12.16 t/s, tg_3s = 20.83 t/s 1.39.106.530 I reasoning-budget: deactivated (natural end)

submitted by /u/BitGreen1270
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA