r/LocalLLaMA · · 3 min read

EPYC hybrid system benches and optimal CPU

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Finally I've built my semi-budget setup, tho not everything went as I expected. Firstly, I purchased EPYC 9555 QS, but was scammed and CPU arrived dead. That time I was only able to afford placeholder 9135 with 2 CCD.

That's why I'm interested in inference numbers of people who bought proper cpu. Everyone talks that 16 CCD and less cores is the best choice (9175f), but based on my research difference is not so big. Otherwise I saw comment that someone benched GLM-5.2 on 9684x (cpu only) and scored 12t/s. My setup's cpu only got me around 7t/s. I've also heard that 9555 would be better than 9355 in some github thread.

https://openbenchmarking.org/ contains only small models benches.

My setup:
768 DDR5 4800, EPYC 9135, RTX 5090

Test command (ik_llama and Ubergarm/Kimi-K2.6 Q4_X):
./llama-sweep-bench \

--model Ubergarm/Kimi-K2.6-Q4_X-00001-of-00014.gguf \

--no-mmap --merge-qkv \

-mla 3 -amb 512 \

-b 4096 -ub 4096 \

-ctk f16 -ctv f16 -c 32000 \

-ngl 999 -ncmoe 999 \

--threads 16 \

--threads-batch 28 \

--warmup-batch \

-n 128

Numbers: b 4096
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |

|-------|--------|--------|----------|----------|----------|----------|

| 4096 | 128 | 0 | 15.701 | 260.87 | 7.168 | 17.86 |

| 4096 | 128 | 4096 | 16.128 | 253.96 | 7.260 | 17.63 |

| 4096 | 128 | 8192 | 16.296 | 251.35 | 7.457 | 17.16 |

| 4096 | 128 | 16384 | 17.006 | 240.86 | 7.519 | 17.02 |

| 4096 | 128 | 32768 | 18.397 | 222.65 | 7.845 | 16.32 |

| 4096 | 128 | 65536 | 20.240 | 202.37 | 8.298 | 15.43 |

Numbers: b 8192

| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |

|-------|--------|--------|----------|----------|----------|----------|

| 8192 | 128 | 0 | 18.564 | 441.28 | 7.081 | 18.08 |

| 8192 | 128 | 8192 | 20.323 | 403.10 | 7.405 | 17.29 |

| 8192 | 128 | 16384 | 21.115 | 387.96 | 7.525 | 17.01 |

Previous 4090 numbers:

| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |

| 4096 | 128 | 0 | 19.716 | 207.75 | 7.269 | 17.61 |

| 4096 | 128 | 4096 | 20.324 | 201.54 | 7.379 | 17.35 |

| 4096 | 128 | 8192 | 20.717 | 197.71 | 7.512 | 17.04 |

I've also found numbers for 6400 DDR5 and EPYC 9355:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 14.985 273.35 6.326 20.24
4096 128 4096 15.316 267.44 6.453 19.83
4096 128 8192 15.662 261.52 6.614 19.35
4096 128 16384 16.399 249.77 6.719 19.05
4096 128 32768 17.656 231.98 6.989 18.31
4096 128 65536 20.666 198.20 8.107 15.79

Other setup for the same ik_llama and Kimi-K2.6 Q4_X: EPYC 9175F and RTX 6000 Pro:

For 17.9 to 21 t/s range, and PP cold in the 223 to 377 t/s

submitted by /u/iVoider
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA