r/LocalLLaMA · · 2 min read

6x P40 running Minimax M2.7_Q3_XL

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

6x P40 running Minimax M2.7_Q3_XL

I've been a lurker for a while and have been building my own home lab with P40's and MI50's. I've learned so much from the community and I just felt like it's time to give back. Even though I'm still learning I'm sure this information will be valuable to someone out there. I'll be posting MI50's details once I'm done fine tuning my P40 box.

Hardware:

Asus X99-E-WS (Modded BIOS to support a large number GPU's )
Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
128GB DDR4 RAM (mixed batch of Non-ECC sticks)
SSD
6x P40's 144GB VRAM (Gen3 x8,x8,x8,x8,x8,x8)

Memory distribution during benchmark

The below table shows benchmarks I ran with my findings:

Test configuration Context pp512 tg128 pp512+tg128 pp4096+tg128 Result
F16 KV, FA on, batch 2048, ubatch 512 32,768 73.20 10.45 33.50 129.51 Original baseline
F16 KV, FA on, batch 2048, ubatch 512 65,536 42.68 6.43 19.49 77.22 Original baseline
F16 KV, FA on, batch 2048, ubatch 512 126,720 24.16 3.51 10.90 44.22 Fits
Q8 KV, FA on, batch 2048, ubatch 512 65,536 42.53 6.14 Slower than F16
Q8 KV, FA on, batch 2048, ubatch 512 126,720 23.91 3.06 Generation −12.8%
F16 KV, FA on, batch 1024, ubatch 256 32,768 105.76 10.70 37.34 128.94 Strong improvement
F16 KV, FA on, batch 1024, ubatch 256 65,536 66.00 6.18 22.63 79.39 Strong improvement
F16 KV, FA on, batch 2048, ubatch 256 32,768 105.91 10.50 37.41 129.42 Selected
F16 KV, FA on, batch 2048, ubatch 256 65,536 65.86 6.38 22.63 79.37 Selected
F16 KV, FA off, batch 1024, ubatch 256 32,768 34.16 2.72 Major regression
F16 KV, FA off, batch 1024, ubatch 256 65,536 19.34 1.50 Major regression
F16 KV, FA off, batch 1024, ubatch 256 126,720 Context creation failed
F16 KV, FA on, 2048/256, GGML_CUDA_P2P=1 32,768 105.76 10.68 37.38 129.40 No measurable gain
F16 KV, FA on, 2048/256, GGML_CUDA_P2P=1 65,536 66.00 6.18 22.63 79.35 No measurable gain
F16 KV, FA on, 2048/256, launch queues 4× 32,768 105.53 10.69 37.36 129.34 No measurable gain
F16 KV, FA on, 2048/256, launch queues 4× 65,536 66.03 6.18 22.63 79.34 No measurable gain
Tensor split Crashed / unsupported
Layer split, equal 1/1/1/1/1/1 Stable and selected

Here is where I ended up as far as optimal configuration is concerned:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
"$HOME/llama.cpp/build-cuda/bin/llama-server" \
-m "$HOME/.lmstudio/models/unsloth/MiniMax-M2.7-GGUF/MiniMax-M2.7-UD-Q3_K_XL-00001-of-00004.gguf" \
-dev CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5 \
-ngl 999 \
--fit off \
--split-mode layer \
--tensor-split 1,1,1,1,1,1 \
--ctx-size 131072 \
--parallel 1 \
--cache-type-k f16 \
--cache-type-v f16 \
--batch-size 2048 \
--ubatch-size 256 \
--flash-attn on \
--jinja \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--n-predict 8192 \
--host 0.0.0.0 \
--port 8080 \
--timeout 30000

submitted by /u/Old_Grapefruit8774
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA