Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Used the vllm version of https://github.com/noonghunna/club-3090
It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090? The project is starting to seem very bloated, at least readme wise.
I use https://github.com/Indras-Mirror/llama.cpp-mtp, I get 60tks with long context.
On mainline llama.cpp and q4 cache I get 60tks but with context filling up fast it drops to 20tks.
Are there any better options, and what is your experience?
EDIT: Using Qwen 3.6 27b Q4
EDIT: I use MTP on mainline ase described above, context is max 4k at good speed on Q4 cache.
[link] [comments]
More from r/LocalLLaMA
-
What's in your RAG?
Jul 2
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.