Combined RTX5080 & 4060 for inference ?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hey, I currently use my RTX 4060 8G for inference with Qwen 3.6-35B-A3B Q8 (q8 for everything weight,value,key) max 60k context per agent (for quality over speed, with CPU &DDR4 offloading) but :
In a few months I plan to add a 2nd GPU to the rig by moving the 4060 from it's current PCIE 5-x16 to PCIE 3-x16 and adding the new GPU on the PCIE 5-x16 slot. My budget for the upgrade (GPU + new powersupply) is in the 1500-2000€ but I'd be much more comfortable in the lower half of that range. TLDRI'm thinking of :
Questions: A. Does anyone use a comparable setup (very fast 16GB card + slower 8GB) and could tell me their stats with Qwen 27B specifying split type, MTP used or not, quants & context size please ? Its certain the bottleneck will be the 4060, but I'm uncertain how badly it will be. B. Even if you don't have one, do you think the proposed setup would work well for llama.cpp (or vllm) ? If not what would you recommend instead ? C. Even if your setup is not exactly comparable, but you have multiple GPUs, do you use llama.cpp or vllm : C.1. when using only one session at a time (no subagents) ? C.2. when hosting your own subagents (maybe only one running at a time still, but there's more KV to hold) ? D. On splitting weights between 2 cards there are 2 ways to do it, either layer or tensor. Layer is slower but does not depend on PCIE speed and tensor split can be quicker with good PCIE speed. Any tips and tricks from people having done this with some really asymmetrical GPUs ? E. For those that have 24GB VRAM total, what quantization of weights, key values do you use for QW3.6 27B and how much context do you manage to have with it ? F. For those that have R9700, are the real performance really that bad ? Only ~30% better pp & 50% better tg with R9700 than with my 300$ 4060 ? Or is it a pb with benchmarks being old (newer versions ROCM...) or performance being much better on recent models ? More details
(Also, AMD is apparently known for not working super well with VR so not really .
Benchmarks in question on older llama models :[link] [comments] |
More from r/LocalLLaMA
-
What's in your RAG?
Jul 2
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.