7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
OS: CatchyOS
Instructions:
Connect monitor to iGPU directly so when you boot Linux your dGPU vram is 100% free since by default when you use your dGPU it consumes about 700mb~1.2gb of lost context space, yes you can still game normally using this approach.
Setup kvcache at q5_0/q4_0 (make sure to compile with CUDA_ALL_QUANTS)
Yes, Q5_0/Q4_0 is 1.6%~ less precise than Q8 by giving 12% less vram usage as proven here: (Qwen does an amazing job with kvcache).
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
Now I can run Qwen 3.6 27B Unsloth Q6K model (22GB~) with 131k context at 55~60t/s
Add these arguments to compile (the blas changes I got from here with a guy saying that it helped him reduce vram usage, and well...)
-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_CUDA_FA_ALL_QUANTS=true You can then just pass the llama.cpp arguments:
-ctv q5_0 -ctk q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0 -c 131000 --ninja --mlock --parallel 1 --no-mmproj [link] [comments]
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.