EPYC hybrid system benches and optimal CPU
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Finally I've built my semi-budget setup, tho not everything went as I expected. Firstly, I purchased EPYC 9555 QS, but was scammed and CPU arrived dead. That time I was only able to afford placeholder 9135 with 2 CCD.
That's why I'm interested in inference numbers of people who bought proper cpu. Everyone talks that 16 CCD and less cores is the best choice (9175f), but based on my research difference is not so big. Otherwise I saw comment that someone benched GLM-5.2 on 9684x (cpu only) and scored 12t/s. My setup's cpu only got me around 7t/s. I've also heard that 9555 would be better than 9355 in some github thread.
https://openbenchmarking.org/ contains only small models benches.
My setup:
768 DDR5 4800, EPYC 9135, RTX 5090
Test command (ik_llama and Ubergarm/Kimi-K2.6 Q4_X):
./llama-sweep-bench \
--model Ubergarm/Kimi-K2.6-Q4_X-00001-of-00014.gguf \
--no-mmap --merge-qkv \
-mla 3 -amb 512 \
-b 4096 -ub 4096 \
-ctk f16 -ctv f16 -c 32000 \
-ngl 999 -ncmoe 999 \
--threads 16 \
--threads-batch 28 \
--warmup-batch \
-n 128
Numbers: b 4096
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 128 | 0 | 15.701 | 260.87 | 7.168 | 17.86 |
| 4096 | 128 | 4096 | 16.128 | 253.96 | 7.260 | 17.63 |
| 4096 | 128 | 8192 | 16.296 | 251.35 | 7.457 | 17.16 |
| 4096 | 128 | 16384 | 17.006 | 240.86 | 7.519 | 17.02 |
| 4096 | 128 | 32768 | 18.397 | 222.65 | 7.845 | 16.32 |
| 4096 | 128 | 65536 | 20.240 | 202.37 | 8.298 | 15.43 |
Numbers: b 8192
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 8192 | 128 | 0 | 18.564 | 441.28 | 7.081 | 18.08 |
| 8192 | 128 | 8192 | 20.323 | 403.10 | 7.405 | 17.29 |
| 8192 | 128 | 16384 | 21.115 | 387.96 | 7.525 | 17.01 |
Previous 4090 numbers:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
| 4096 | 128 | 0 | 19.716 | 207.75 | 7.269 | 17.61 |
| 4096 | 128 | 4096 | 20.324 | 201.54 | 7.379 | 17.35 |
| 4096 | 128 | 8192 | 20.717 | 197.71 | 7.512 | 17.04 |
I've also found numbers for 6400 DDR5 and EPYC 9355:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 14.985 | 273.35 | 6.326 | 20.24 |
| 4096 | 128 | 4096 | 15.316 | 267.44 | 6.453 | 19.83 |
| 4096 | 128 | 8192 | 15.662 | 261.52 | 6.614 | 19.35 |
| 4096 | 128 | 16384 | 16.399 | 249.77 | 6.719 | 19.05 |
| 4096 | 128 | 32768 | 17.656 | 231.98 | 6.989 | 18.31 |
| 4096 | 128 | 65536 | 20.666 | 198.20 | 8.107 | 15.79 |
Other setup for the same ik_llama and Kimi-K2.6 Q4_X: EPYC 9175F and RTX 6000 Pro:
For 17.9 to 21 t/s range, and PP cold in the 223 to 377 t/s
[link] [comments]
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.