r/LocalLLaMA · · 2 min read

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs?

Mac can host large models but the prefill speed sucks, so I tested in it on my setup for Kimi 2.7.

Short answer: it helps prefill, but it does not meaningfully help decode on this setup. RPC is still mostly a capacity tool unless the network/interconnect and split mode are much better.

Setup

  • Host: Mac Studio M3 Ultra, 512GB unified memory, Metal
  • Worker: Linux box with NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96GB VRAM, CUDA
  • Network: direct Ethernet between Mac and Linux box, but only 1GbE in practice
  • Measured RPC transfer rate: about 112-113 MiB/s
  • Model: unsloth/Kimi-K2.7-Code-GGUF, UD-Q3_K_XL
  • Model size on disk: about 432GB across 11 GGUF shards
  • Runtime: llama.cpp server version 9827 (4c6e0ff3a), Unsloth build

Controlled test

Same synthetic prompt for both runs:

  • Prompt tokens: 7120
  • Generated tokens: 64
  • temperature: 0
  • ignore_eos: true
  • Prompt cache disabled
  • Prefill gain: about 14.8%
  • Decode gain: about 4.2%
  • Total request time improvement: about 12.3%

Split trend

The generation columns are - where I only ran prefill. The controlled generation rows used the exact same 7120-token synthetic prompt; the earlier split-sweep rows were around 7.1K prompt tokens but not always the exact same prompt.

Run RTX share Split Prompt sec Prefill tok/s Decode Total RTX VRAM
Mac 0% - 53.58 132.88 17.55 tok/s 57.23s none
Mac + RTX 15% 15,85 51.48 138.3 - - 69.4GB
Mac + RTX 19% 19,81 50.22 141.77 - - 84.1GB
Mac + RTX 20% 20,80 49.54 143.72 - - 93.2GB
Mac + RTX 20% 20,80 46.69 152.49 18.28 tok/s 50.19s 93.3GB
Mac + RTX 21% 21,79 - failed - - failed

20,80 was the practical max on this card with 128K context.

21,79 failed even at 8K context:

RPC/network trace

For the 7120-token prefill-only 20,80 run:

  • Mac -> RTX: 251.59 MiB, 2.03s
  • RTX -> Mac: 194.69 MiB, 1.49s
  • Total RPC traffic: 446.28 MiB, 3.52s
  • RTX graph compute: 1.34s

The RPC traffic is mostly hidden activations, not text tokens. For prefill it is chunked/batched, so the network cost is noticeable but not fatal. For decode, the boundary is crossed every generated token, which is why I expected decode to suffer more. In this test decode was roughly the same as Mac-only: 18.28 tok/s vs 17.55 tok/s.

Learnings

  • I can knock off few more seconds by using a better cable, but not sure it's worth it
  • It is useful for fitting models/splits that otherwise do not fit one device.

Question: As I was increase the shards, the prefill speed was decreasing, but will this trend continue if I add one more GPU? People with multi GPU setup what's you take on this?

submitted by /u/No_Run8812
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA