| I've been a lurker for a while and have been building my own home lab with P40's and MI50's. I've learned so much from the community and I just felt like it's time to give back. Even though I'm still learning I'm sure this information will be valuable to someone out there. I'll be posting MI50's details once I'm done fine tuning my P40 box. Hardware: Asus X99-E-WS (Modded BIOS to support a large number GPU's ) Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz 128GB DDR4 RAM (mixed batch of Non-ECC sticks) SSD 6x P40's 144GB VRAM (Gen3 x8,x8,x8,x8,x8,x8) Memory distribution during benchmark The below table shows benchmarks I ran with my findings: | Test configuration | Context | pp512 | tg128 | pp512+tg128 | pp4096+tg128 | Result | | F16 KV, FA on, batch 2048, ubatch 512 | 32,768 | 73.20 | 10.45 | 33.50 | 129.51 | Original baseline | | F16 KV, FA on, batch 2048, ubatch 512 | 65,536 | 42.68 | 6.43 | 19.49 | 77.22 | Original baseline | | F16 KV, FA on, batch 2048, ubatch 512 | 126,720 | 24.16 | 3.51 | 10.90 | 44.22 | Fits | | Q8 KV, FA on, batch 2048, ubatch 512 | 65,536 | 42.53 | 6.14 | — | — | Slower than F16 | | Q8 KV, FA on, batch 2048, ubatch 512 | 126,720 | 23.91 | 3.06 | — | — | Generation −12.8% | | F16 KV, FA on, batch 1024, ubatch 256 | 32,768 | 105.76 | 10.70 | 37.34 | 128.94 | Strong improvement | | F16 KV, FA on, batch 1024, ubatch 256 | 65,536 | 66.00 | 6.18 | 22.63 | 79.39 | Strong improvement | | F16 KV, FA on, batch 2048, ubatch 256 | 32,768 | 105.91 | 10.50 | 37.41 | 129.42 | Selected | | F16 KV, FA on, batch 2048, ubatch 256 | 65,536 | 65.86 | 6.38 | 22.63 | 79.37 | Selected | | F16 KV, FA off, batch 1024, ubatch 256 | 32,768 | 34.16 | 2.72 | — | — | Major regression | | F16 KV, FA off, batch 1024, ubatch 256 | 65,536 | 19.34 | 1.50 | — | — | Major regression | | F16 KV, FA off, batch 1024, ubatch 256 | 126,720 | — | — | — | — | Context creation failed | F16 KV, FA on, 2048/256, GGML_CUDA_P2P=1 | 32,768 | 105.76 | 10.68 | 37.38 | 129.40 | No measurable gain | F16 KV, FA on, 2048/256, GGML_CUDA_P2P=1 | 65,536 | 66.00 | 6.18 | 22.63 | 79.35 | No measurable gain | | F16 KV, FA on, 2048/256, launch queues 4× | 32,768 | 105.53 | 10.69 | 37.36 | 129.34 | No measurable gain | | F16 KV, FA on, 2048/256, launch queues 4× | 65,536 | 66.03 | 6.18 | 22.63 | 79.34 | No measurable gain | | Tensor split | — | — | — | — | — | Crashed / unsupported | Layer split, equal 1/1/1/1/1/1 | — | — | — | — | — | Stable and selected | Here is where I ended up as far as optimal configuration is concerned: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \ "$HOME/llama.cpp/build-cuda/bin/llama-server" \ -m "$HOME/.lmstudio/models/unsloth/MiniMax-M2.7-GGUF/MiniMax-M2.7-UD-Q3_K_XL-00001-of-00004.gguf" \ -dev CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5 \ -ngl 999 \ --fit off \ --split-mode layer \ --tensor-split 1,1,1,1,1,1 \ --ctx-size 131072 \ --parallel 1 \ --cache-type-k f16 \ --cache-type-v f16 \ --batch-size 2048 \ --ubatch-size 256 \ --flash-attn on \ --jinja \ --temp 1.0 \ --top-p 0.95 \ --top-k 40 \ --min-p 0.01 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --n-predict 8192 \ --host 0.0.0.0 \ --port 8080 \ --timeout 30000 submitted by /u/Old_Grapefruit8774 [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.