r/LocalLLaMA · July 2, 2026 · 2 min read

6x P40 running Minimax M2.7_Q3_XL

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I've been a lurker for a while and have been building my own home lab with P40's and MI50's. I've learned so much from the community and I just felt like it's time to give back. Even though I'm still learning I'm sure this information will be valuable to someone out there. I'll be posting MI50's details once I'm done fine tuning my P40 box.

Hardware:

Asus X99-E-WS (Modded BIOS to support a large number GPU's )
Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
128GB DDR4 RAM (mixed batch of Non-ECC sticks)
SSD
6x P40's 144GB VRAM (Gen3 x8,x8,x8,x8,x8,x8)

Memory distribution during benchmark

The below table shows benchmarks I ran with my findings:

Test configuration	Context	pp512	tg128	pp512+tg128	pp4096+tg128	Result
F16 KV, FA on, batch 2048, ubatch 512	32,768	73.20	10.45	33.50	129.51	Original baseline
F16 KV, FA on, batch 2048, ubatch 512	65,536	42.68	6.43	19.49	77.22	Original baseline
F16 KV, FA on, batch 2048, ubatch 512	126,720	24.16	3.51	10.90	44.22	Fits
Q8 KV, FA on, batch 2048, ubatch 512	65,536	42.53	6.14	—	—	Slower than F16
Q8 KV, FA on, batch 2048, ubatch 512	126,720	23.91	3.06	—	—	Generation −12.8%
F16 KV, FA on, batch 1024, ubatch 256	32,768	105.76	10.70	37.34	128.94	Strong improvement
F16 KV, FA on, batch 1024, ubatch 256	65,536	66.00	6.18	22.63	79.39	Strong improvement
F16 KV, FA on, batch 2048, ubatch 256	32,768	105.91	10.50	37.41	129.42	Selected
F16 KV, FA on, batch 2048, ubatch 256	65,536	65.86	6.38	22.63	79.37	Selected
F16 KV, FA off, batch 1024, ubatch 256	32,768	34.16	2.72	—	—	Major regression
F16 KV, FA off, batch 1024, ubatch 256	65,536	19.34	1.50	—	—	Major regression
F16 KV, FA off, batch 1024, ubatch 256	126,720	—	—	—	—	Context creation failed
F16 KV, FA on, 2048/256, `GGML_CUDA_P2P=1`	32,768	105.76	10.68	37.38	129.40	No measurable gain
F16 KV, FA on, 2048/256, `GGML_CUDA_P2P=1`	65,536	66.00	6.18	22.63	79.35	No measurable gain
F16 KV, FA on, 2048/256, launch queues 4×	32,768	105.53	10.69	37.36	129.34	No measurable gain
F16 KV, FA on, 2048/256, launch queues 4×	65,536	66.03	6.18	22.63	79.34	No measurable gain
Tensor split	—	—	—	—	—	Crashed / unsupported
Layer split, equal `1/1/1/1/1/1`	—	—	—	—	—	Stable and selected

Here is where I ended up as far as optimal configuration is concerned:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
"$HOME/llama.cpp/build-cuda/bin/llama-server" \
-m "$HOME/.lmstudio/models/unsloth/MiniMax-M2.7-GGUF/MiniMax-M2.7-UD-Q3_K_XL-00001-of-00004.gguf" \
-dev CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5 \
-ngl 999 \
--fit off \
--split-mode layer \
--tensor-split 1,1,1,1,1,1 \
--ctx-size 131072 \
--parallel 1 \
--cache-type-k f16 \
--cache-type-v f16 \
--batch-size 2048 \
--ubatch-size 256 \
--flash-attn on \
--jinja \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--n-predict 8192 \
--host 0.0.0.0 \
--port 8080 \
--timeout 30000

submitted by /u/Old_Grapefruit8774
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA