GLM 5.2, what speeds are we getting locally?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Can everyone that is able to run GLM 5.2 locally report what their inference engine, system specs, quantization, context size, and tokens/sec? If you're getting great numbers expect follow-up questions. I'll start:
llamma.cpp, 6x RTX 3090, 128 DDR5, i7-13700K, unsloth UD-IQ2_M, 90K context @ Q8_0 KV: 7.8 tokens/sec generation, prompt processing was roughly 40 tokens/sec
[link] [comments]
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.