How to improve RAM offload?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I have only 12GB VRAM (RTX3060) but have enough RAM to run Qwen3.6 27B Q4 with offload. Something tells me that it won't achieve maximum performance but why DRAM speed is only around 30GB/s (HWiNFO data) during inference with dual channel 5200 RAM? TG is 3.12 tok/sec with 18K tokens result. I expected slow speed, but can't understand where is the bottleneck, is it how LM Studio works or I need better CPU (I have 7500F). Of course dual 3090 will do the work, but it is what is for now. Tried smaller prompt with 6 CPU threads, Q8 KV cache, 37 GPU offload, got TG 4.95 tok/sec and bandwidth was 30-35GB/s. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.