r/LocalLLaMA · June 27, 2026 · 1 min read

Another big tensor fix b9820

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

sched : reintroduce less synchronizations during split compute (#20793)

CUDA: Improve performance via less synchronizations between token (#17795)
Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async()
Adds function to relax sync requirements between input copies on supported backends (CUDA for now)
Exchanges synchronous copy with async copy function.
Adds macro guards to allow compilation in non-CUDA builds
Reworked backend detection in ggml-backend.cpp to avoid linking conflicts
Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues
Minor cleanup
Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now.
Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU.
Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization
Simplifies synchronizations to adhere to saaasg pattern.
Apply suggestion from u/ggerganov (src->buffer to buf_src)

Discussion (0)

No comments yet. Sign in and be the first to say something.