llama.cpp releases · · 2 min read

b9866

Mirrored from llama.cpp releases for archival readability. Support the source by reading on the original site.

cuda: enable topk-moe fusion for 288 experts (#25267)

  • cuda: enable topk-moe fusion for 288 experts

The topk-moe fusion only accepted power-of-2 expert counts (or the
special-cased 576), so models with 288 experts (e.g. Step-3.7-Flash)
fell back to the unfused per-layer routing chain: softmax/sigmoid,
argsort, get_rows, sum_rows, div, clamp, scale. At batch size 1 that
is ~330 extra tiny graph nodes per token.

288 is a multiple of the warp size, so the existing kernel already
handles it; this adds the missing template instantiation and accepts
288 in the eligibility check.

Measured on gfx1151 with Step-3.7-Flash IQ4_XS (llama-bench,
-b 4096 -ub 4096 -fa 1 -dio 1 -ctk q8_0 -ctv q8_0; machine idle,
before/after paired so pp4096 stays matched as a load control):

test | before | after
----------------+----------------+----------------
pp4096 | 460.99 ± 0.45 | 462.47 ± 0.34 (unchanged)
tg128 | 19.10 ± 0.04 | 19.56 ± 0.03 (+2.4%)
tg128 @ d30000 | 12.68 ± 0.04 | 12.69 ± 0.03 (unchanged)

Prompt processing is unaffected (the fusion only touches decode
routing). The decode gain is ~+2.4% at shallow context and fades with
depth: by 30k tokens each step is attention-bound over the KV cache,
so removing the fixed routing overhead is no longer visible.

Assisted-By: Claude Fable 5 noreply@anthropic.com

  • Update tests/test-backend-ops.cpp

Co-authored-by: Oliver Simons osimons@nvidia.com

  • Add comment for case 288 in topk-moe.cu

Co-authored-by: Oliver Simons osimons@nvidia.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from llama.cpp releases