Follow-up: GLM-5.2 NVFP4 on four DGX Sparks — the MTP mystery is solved, and it's now ~24 tok/s at 128K context
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Follow-up: GLM-5.2 NVFP4 on four DGX Sparks — the MTP mystery is solved, and it's now ~24 tok/s at 128K context
This is a follow-up to my earlier post about running GLM-5.2 NVFP4 on 4x DGX Spark at 128K context. Short version of that post: 128K worked at ~15 tok/s with MTP1, and there was a painful tradeoff where you could have 128K context OR ~23 tok/s (DCP1 at 32K), but not both. I also flagged that MTP2/MTP3 acceptance collapse at DCP4 "really looks buggy" but that 30 hours of digging hadn't cracked it.
It was buggy. It's cracked. Tradeoff gone. Here's how it shook out:
TL;DR
| old post (DCP4/128K/MTP1) | now (DCP4/128K/MTP3) | now (DCP4/128K/MTP4) |
|---|---|---|
| decode, short codegen (hot) | 14.5-15.2 tok/s | 22-23 tok/s |
| MTP acceptance per position | 0.74 (MTP1 only) | 0.90 / 0.79 / 0.67 |
| context | 131,072 | 131,072 |
| hardware | 4x GB10 Spark + MikroTik RoCE | unchanged |
Yes, MTP4 — the recursively-reused single MTP layer is still conditionally accepting at ~0.84 by position 4, which mirrors what I see on my RTX 6000 Pro box where MTP4 is also the peak. One config gotcha: MAX_CUDAGRAPH_CAPTURE_SIZE needs headroom above num_speculative_tokens + 1 (the draft derives a smaller cap than the target; exactly N+1 fails startup with "No valid cudagraph sizes"). I run 10 for MTP4. I've seen occasional runs sag when host paging churns — MTP3 is my conservative default, MTP4 the peak config.
Same machines, same switch, same checkpoint, same 1.81 GB/rank KV budget. The entire gain is one missing line of configuration plumbing in vLLM, plus rebasing onto a newer upstream branch. The DCP1/32K compromise config is now pointless: DCP4 at full context beats it outright.
What the bug actually was
In my original post I wrote that acceptance looked like 0.9, 0.75^4, 0.6^4 and guessed at some rank-intersection effect. The exponent intuition was pointing at something real (the damage does scale with DCP world size), but the mechanism was better-hidden than that — and the reason it survived 30+ hours of ablations is genuinely evil:
SpeculativeConfig.create_draft_parallel_config() builds the draft model's parallel config by copying fields from the target config — and decode_context_parallel_size is not one of the fields it copies. It silently defaults to 1. On the code path my stack uses, that value is consumed verbatim.
So under TP4/DCP4, the MTP draft layer's KV cache, metadata, and sparse-indexer state were all DCP-sharded (the writer side runs under the target config), while the draft's attention thought it wasn't under DCP at all: no query all-gather, no LSE merge, and the global top-k indices were consumed as if the local quarter-cache were the whole cache. Tensor dumps showed draft forwards where three of four ranks selected nothing and emitted literal all-zero attention for their 48 of 64 heads.
Here's the evil part: the very next op after attention is o_proj, which is row-parallel — its TP all-reduce sums the four inconsistent per-rank results into one hidden state that is bit-identical on every rank. Every cross-rank divergence check I ran in the original investigation came back clean, because the corruption is laundered into consensus one op after it happens. And because the draft gets the target's hidden state as input, single-step MTP1 mostly survives on that signal (~0.75 acceptance), while the recursive steps 2-3 compound the garbage and die. That's the collapse curve from my first post.
It also explains why the bug shrugged off every knob: KV interleave size, ag_rs vs a2a DCP comm backend, global vs rank-local top-k, CUDA graphs vs eager — none of them touch how the draft's parallel config is constructed. I tested all of them (identical acceptance curves to two decimal places) before giving up on config space and building a tensor tap instead.
How it got found
Method notes, since I know some people like gory details:
- Rebased the stack onto a much newer upstream branch (see below). Capacity reproduced exactly; MTP3 still collapsed. That killed "it's fixed upstream" and "it's my old fork."
- Burned four more boots falsifying the remaining config hypotheses (interleave/comm-backend/top-k-mode/eager). All identical. At that point the bug had to be in the compute, not the config surface.
- Wrote a small env-gated tap into the MLA decode path that dumps, per draft-layer forward: the post-allgather query, the top-k indices actually consumed, per-rank partial output + LSE, the merged output, the metadata, and the raw fp8 KV pages.
- Calibrated the tap at DCP1: an fp64 reference attention over the dequantized fp8_ds_mla cache reproduced the kernel's outputs at cosine ≥ 0.9999 on every forward. So the instrument was trustworthy.
- Ran the same probe at DCP4 and read the dumps:
impl.dcp_world_size == 1on every rank, merged output byte-identical to the pre-merge partial (i.e., no merge ever ran), DCP-local sequence lengths (6/6/5/5 for a global 22) feeding a non-DCP attention, zero-output ranks. From there the config trace back tocreate_draft_parallel_configtook about twenty minutes.
The fix is ~10 lines mirroring logic that upstream already has on their newer runner path (which is why big SM120 rigs never saw this — they run the code path that has the fix; my stack runs the one that doesn't). PR with the fix, three companion patches for GLM-routed checkpoints, and the full evidence is up as a draft:
https://github.com/local-inference-lab/vllm/pull/72
The updated recipe
Everything is in the same repo as before, same recipe directory:
github.com/m9e/blackwell-llm-docker→recipes/4x-spark-cluster/glm52-b12x-spark/- New production entry point:
start-glm52-production.sh(DCP4 / MTP3 / 128K, diagnostics off) - The image is now built from a much newer upstream base (
local-inference-lab/vllmeldritch line, June 29 + b12x) with a 5-file overlay on top: the Spark Ray-startup fix and post-load malloc_trim from the original post, plus the DCP draft fixes. Build scripts in the recipe dir. ELDRITCH_REBASE_NOTES.mdin the recipe dir has the whole investigation written up — every falsified hypothesis with numbers, the dump evidence, and the memory ledger.- One embarrassing find worth flagging if you followed the original post: the NCCL channel narrowing (
NCCL_MAX/MIN_NCHANNELS=4, pinnedNCCL_IB_HCA) that I described as part of the memory win had never actually made it into the committed launch scripts — it was applied by hand during the original campaign. It's committed now. If you cloned the recipe before, you were running default channel counts and leaving memory on the table.
Everything else from the original post still applies: the aggressive OS/Ray pruning, host networking, fp8_ds_mla KV, the hybrid checkpoint assembly script (you still need the real model.layers.78.* MTP layer), and IB/RDMA on over the Spark fabric. The hardware section is unchanged down to the switch.
What I'd revise from the original post:
- "MTP2/MTP3 are research territory" → wrong, they were just broken. MTP3 is the production default now.
- "This setup has exactly one MTP layer, so MTP1 is the clean production point" → the one-layer recursion works fine once the draft can actually see the context it's drafting from. Position-3 acceptance is 0.67, which for a recursively-reused single-step head is honestly better than I expected.
- The 409/512 prefill oscillation from the original post: still there, still unexplained, still doesn't matter much.
Open threads
- A clean long-context decode measurement on the fixed stack (my first depth probe ran during host paging churn and isn't fair to report; the old MTP1 baseline was ~13 tok/s post-TTFT at 32K-112K, and acceptance doesn't decay much with depth, so I expect high teens — will follow up in comments with a clean number).
- A b12x-MoE-for-the-draft A/B and a DCP2 retest on the fixed stack, mostly for the config matrix's sake.
- The fp8_ds_mla quality question from the original post still deserves its own writeup.
One more point of reference, since expert-pruned GLM-5.2 checkpoints have been posting eye-catching Spark numbers lately: those runs get their headroom by dropping experts (e.g. 256 → 218 via a straight correction-bias ranking, with no recovery tuning at all) and/or running reduced context. Every number in this post is the full 256-expert checkpoint at 131,072 context. You don't have to prune this model to make four Sparks fast anymore.
If you have Sparks and were sitting on the 15 tok/s config: rebuild from the recipe, or wait for the PR to land upstream and rebuild from theirs. Four Sparks now run a 744B-class model at 128K context at ~24 tok/s, and the only thing that changed since last week is that the speculative decoder is no longer being fed a shredded view of its own cache.
Now, that's not exactly *blazing* - on an 8 RTX6000 pro you can get a hair over 100 TPS, and the folks cranking it on max hardware setups like together are clocking >300 tps.
But I checked my nodes -- and remember we're almost certainly memory b/w bound;
this is frontier intelligence at 120 watts. Pretty awesome.
Oh! and as one more tiny thing - h/t https://www.reddit.com/user/Front_Eagle739/ - who reminded me of omlx, which I tried, and on an m3ultra it cut a c=112k wall time from over 6000 seconds to about 1000 seconds. It basically maintained 100+ tps prefill the entire time instead of completely collapsing to misery as context got long. Still slow - 14-16 vs the spark doing 500 prefill (5x) and ~24 decode (+66%) but still - it was enough to promote the mac to "usable" for the model, imo. (omlx also handles kv cache strongly, which my own harness also did)
[link] [comments]
More from r/LocalLLaMA
-
Pay attention: a few chats waiting in tray reserve 1GB VRAM for themselves.
Jul 3
-
Toolport: Use as many MCP servers as you want without the token tax
Jul 3
-
[audio.cpp] The Sound of GGML — C++/GGML native ACE-Step, Stable Audio, HeartMuLa, RoFormer, HTDemucs released. 10-Minute Music in 60 Seconds!
Jul 3
-
llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.