Cheapest way to run GLM 5.x locally that's not a unified memory system?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
This is primarily an exercise to determine the possible options, obscure as they might be, to run at least a 4bit quant (let's say roughly IQ4_XS).
Got a CPU only setup? Please share your experience. Sapphire Rapids ES 56core + DDR5 might be an option
Multi GPU setups with partial or complete offloading? What's your performance like?
It's not limited to GLM 5.x, anything similarly sized is ok too for the scope of this discussion.
Personally, I'm running a 5900X + 128GB DDR4 + 7900XT 20GB. The largest model I can run is Minimax 2.7 from AesSedAI at Q4_K_S - https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF
For smaller stuff, it's still Qwen 3.6 27B at IQ4_XS from Unsloth/Bartowski.
[link] [comments]
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.