Skip to content

New optimization from NVIDIA to use CUDA Graphs in llama.cpp #6763

Closed
@agray3

Description

@agray3

Great work everyone on llama.cpp! I am Alan Gray, a developer technology engineer from NVIDIA, and have developed an optimization to allow the CUDA kernels associated with the generation of each token to be launched and executed as a single CUDA Graph, which is giving around 10-15% overall speedup on our GPUs for the llama2 7B cases I have tested so far. More details are below. Could someone please add me to the project (username agray3) and I'll push the branch? It will require a bit more testing and tweaking before it is ready for a PR.

For an introduction to CUDA Graphs, see the blog I wrote a few years ago: https://developer.nvidia.com/blog/cuda-graphs/
In llama.cpp, I use the stream capture functionality that is introduced in the blog, which allows the patch to be very non-intrusive - it is isolated within ggml_backend_cuda_graph_compute in ggml-cuda.cu (except a utility function to get a function pointer from ggml-cuda/cpy.cu).

For example, inference for llama-2-7b.Q4_K_M on H100-PCIe (with --n-gpu-layers 100 -n 128) the performance goes from 143.35 to 163.83 tokens per second (14% speedup).

Here are some screenshots from NSight Systems which show why using CUDA graphs is of benefit.

Here is the execution of a token using the current llama.cpp:
nograph

Each CUDA kernel is launched and executed separately. The entries highlighed shows the launch API call associated with a specific kernel.

Zoomed in:
nograph_zoom
The main problem is the gaps between the kernels. (Note in this case these gaps are actually mostly due to GPU-side launch overheads rather than CPU API calls.)

With CUDA Graphs:
graph

The whole token generation is launched by a single CUDA graph. Zoomed in:
graph_zoom
The use of CUDA graphs has allowed the kernels to be much more tightly packed.

The execution of the graph itself is actually around 40% faster with CUDA graphs. This overall speedup is lower (14%) largely due to overheads associated with creating and launching the graph, but there is scope to further reduce these overheads in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions