DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Bartowski's DeepSeek-V4-Flash-MXFP4 GGUF, llama.cpp build 9851 (0eca4d490), deepseek4 arch.
Ran the same n_ctx = 10240, same n_ubatch = n_batch = 8192, flash attention on — only difference is -ctk/-ctv:
| Cache type | Total KV cache (CUDA0) | CUDA0 compute buffer |
|---|---|---|
f16 (default, no -ctk/-ctv set) | ~425 MiB | 12,964 MiB |
q8_0 (-ctk q8_0 -ctv q8_0) | ~226 MiB | 3,973 MiB |
So switching the KV cache quant type only saves ~200MB of actual cache (expected — DSV4's compressed CSA/HCA/lightning-indexer caches are tiny either way), but it shaves ~9GB off the compute buffer — a 3.26x difference — with literally nothing else changed.
This is what was actually causing my OOM at higher context (35.9GB compute buffer requested at ctx=32000 with f16 cache, on a 32GB card). Once I forced q8_0 cache, it loads fine.
Does forcing -ctk q8_0 -ctv q8_0 cut your compute buffer by a similar ~3x?
[link] [comments]
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.