Tip: use this llama.cpp PR to improve PP on Intel ARC
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
https://github.com/ggml-org/llama.cpp/pull/25222
Another win for Intel ARC users (all 4 of us). The community keeps improving llama.cpp for Intel ARC. This time, the hero from that Pull Request (with the help of Claude) improved the prompt processing speed by a lot. For comparison, I have a B580 and a 116k context conversation and it used to take 510 seconds to process everything from scratch, 245t/s; now it takes 262 seconds and a very fast speed of 462t/s; Qwen3.6 35B A3B Q5_K_XL ./llama-server --host 0.0.0.0 --port 8080 --model /models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --jinja --threads 8 --ctx-size 262144 --cache-ram 0 --parallel 1 --temperature 0.0 --top-p 0.2 --top-k 20 --no-mmap --spec-type draft-mtp --spec-draft-n-max 3 --batch-size 2700 --ubatch-size 2700 --n-gpu-layers 99 --n-cpu-moe 99. The only catch is that it is for F16 KV for now, but the contributor said he will work on other quants later.
You see, Intel's hardware is very capable of doing great things and each contribution by the community and Intel makes us closer to achieving the full speed of the hardware
[link] [comments]
More from r/LocalLLaMA
-
llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090
Jul 2
-
Made a new 350M model to compete with lfm2.5 but with an open license
Jul 2
-
Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith
Jul 2
-
July 4th is coming up, is there any vision model that's good for picking up fire?
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.