Description
Great work everyone on llama.cpp! I am Alan Gray, a developer technology engineer from NVIDIA, and have developed an optimization to allow the CUDA kernels associated with the generation of each token to be launched and executed as a single CUDA Graph, which is giving around 10-15% overall speedup on our GPUs for the llama2 7B cases I have tested so far. More details are below. Could someone please add me to the project (username agray3
) and I'll push the branch? It will require a bit more testing and tweaking before it is ready for a PR.
For an introduction to CUDA Graphs, see the blog I wrote a few years ago: https://developer.nvidia.com/blog/cuda-graphs/
In llama.cpp, I use the stream capture functionality that is introduced in the blog, which allows the patch to be very non-intrusive - it is isolated within ggml_backend_cuda_graph_compute
in ggml-cuda.cu
(except a utility function to get a function pointer from ggml-cuda/cpy.cu
).
For example, inference for llama-2-7b.Q4_K_M
on H100-PCIe (with --n-gpu-layers 100 -n 128
) the performance goes from 143.35 to 163.83 tokens per second (14% speedup).
Here are some screenshots from NSight Systems which show why using CUDA graphs is of benefit.
Here is the execution of a token using the current llama.cpp:
Each CUDA kernel is launched and executed separately. The entries highlighed shows the launch API call associated with a specific kernel.
Zoomed in:
The main problem is the gaps between the kernels. (Note in this case these gaps are actually mostly due to GPU-side launch overheads rather than CPU API calls.)
The whole token generation is launched by a single CUDA graph. Zoomed in:
The use of CUDA graphs has allowed the kernels to be much more tightly packed.
The execution of the graph itself is actually around 40% faster with CUDA graphs. This overall speedup is lower (14%) largely due to overheads associated with creating and launching the graph, but there is scope to further reduce these overheads in the future.