r/LocalLLaMA · July 2, 2026 · 2 min read

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs?

Mac can host large models but the prefill speed sucks, so I tested in it on my setup for Kimi 2.7.

Short answer: it helps prefill, but it does not meaningfully help decode on this setup. RPC is still mostly a capacity tool unless the network/interconnect and split mode are much better.

Setup

Host: Mac Studio M3 Ultra, 512GB unified memory, Metal
Worker: Linux box with NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96GB VRAM, CUDA
Network: direct Ethernet between Mac and Linux box, but only 1GbE in practice
Measured RPC transfer rate: about 112-113 MiB/s
Model: unsloth/Kimi-K2.7-Code-GGUF, UD-Q3_K_XL
Model size on disk: about 432GB across 11 GGUF shards
Runtime: llama.cpp server version 9827 (4c6e0ff3a), Unsloth build

Controlled test

Same synthetic prompt for both runs:

Prompt tokens: 7120
Generated tokens: 64
temperature: 0
ignore_eos: true
Prompt cache disabled
Prefill gain: about 14.8%
Decode gain: about 4.2%
Total request time improvement: about 12.3%

Split trend

The generation columns are - where I only ran prefill. The controlled generation rows used the exact same 7120-token synthetic prompt; the earlier split-sweep rows were around 7.1K prompt tokens but not always the exact same prompt.

Run	RTX share	Split	Prompt sec	Prefill tok/s	Decode	Total	RTX VRAM
Mac	0%	-	53.58	132.88	17.55 tok/s	57.23s	none
Mac + RTX	15%	15,85	51.48	138.3	-	-	69.4GB
Mac + RTX	19%	19,81	50.22	141.77	-	-	84.1GB
Mac + RTX	20%	20,80	49.54	143.72	-	-	93.2GB
Mac + RTX	20%	20,80	46.69	152.49	18.28 tok/s	50.19s	93.3GB
Mac + RTX	21%	21,79	-	failed	-	-	failed

20,80 was the practical max on this card with 128K context.

21,79 failed even at 8K context:

RPC/network trace

For the 7120-token prefill-only 20,80 run:

Mac -> RTX: 251.59 MiB, 2.03s
RTX -> Mac: 194.69 MiB, 1.49s
Total RPC traffic: 446.28 MiB, 3.52s
RTX graph compute: 1.34s

The RPC traffic is mostly hidden activations, not text tokens. For prefill it is chunked/batched, so the network cost is noticeable but not fatal. For decode, the boundary is crossed every generated token, which is why I expected decode to suffer more. In this test decode was roughly the same as Mac-only: 18.28 tok/s vs 17.55 tok/s.

Learnings

I can knock off few more seconds by using a better cable, but not sure it's worth it
It is useful for fitting models/splits that otherwise do not fit one device.

Question: As I was increase the shards, the prefill speed was decreasing, but will this trend continue if I add one more GPU? People with multi GPU setup what's you take on this?

submitted by /u/No_Run8812
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.