New optimization from NVIDIA to use CUDA Graphs in llama.cpp

Great work everyone on llama.cpp! I am Alan Gray, a developer technology engineer from NVIDIA, and have developed an optimization to allow the CUDA kernels associated with the generation of each token to be launched and executed as a single CUDA Graph, which is giving around 10-15% overall speedup on our GPUs for the llama2 7B cases I have tested so far.  More details are below. Could someone please add me to the project (username `agray3`) and I'll push the branch? It will require a bit more testing and tweaking before it is ready for a PR.

For an introduction to CUDA Graphs, see the blog I wrote a few years ago: https://developer.nvidia.com/blog/cuda-graphs/
In llama.cpp, I use the stream capture functionality that is introduced in the blog, which allows the patch to be very non-intrusive - it is isolated within `ggml_backend_cuda_graph_compute` in `ggml-cuda.cu` (except a utility function to get a function pointer from `ggml-cuda/cpy.cu`).

For example, inference for `llama-2-7b.Q4_K_M` on H100-PCIe (with `--n-gpu-layers 100 -n 128`) the performance goes from  **143.35** to **163.83** tokens per second (**14% speedup**).

Here are some screenshots from NSight Systems which show why using CUDA graphs is of benefit.

Here is the execution of a token using the current llama.cpp:
![nograph](https://github.com/ggerganov/llama.cpp/assets/10851179/50f560dc-ae37-425a-9f60-7f01c3d21920)

Each CUDA kernel is launched and executed separately. The entries highlighed shows the launch API call associated with a specific kernel.

Zoomed in:
![nograph_zoom](https://github.com/ggerganov/llama.cpp/assets/10851179/205e3b39-224f-41e0-a14f-87290d1d1b35)
The main problem is the gaps between the kernels. (Note in this case these gaps are actually mostly due to GPU-side launch overheads rather than CPU API calls.)

With CUDA Graphs:
![graph](https://github.com/ggerganov/llama.cpp/assets/10851179/439bf81f-65c2-453e-a400-24ccb8eeb53d)

The whole token generation is launched by a single CUDA graph. Zoomed in:
![graph_zoom](https://github.com/ggerganov/llama.cpp/assets/10851179/a76f729d-efd6-464d-a995-2abf1205d9c3)
The use of CUDA graphs has allowed the kernels to be much more tightly packed. 

The execution of the graph itself is actually around 40% faster with CUDA graphs. This overall speedup is lower (14%) largely due to overheads associated with creating and launching the graph, but there is scope to further reduce these overheads in the future.



 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New optimization from NVIDIA to use CUDA Graphs in llama.cpp #6763

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New optimization from NVIDIA to use CUDA Graphs in llama.cpp #6763

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions