forked from ggerganov/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs : add performance troubleshoot + example benchmark documentation (…
…ggerganov#1674) * test anchor link * test table * add benchmarks * Add performance troubleshoot & benchmark * add benchmarks * remove unneeded line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
- Loading branch information
1 parent
f146562
commit f4c55d3
Showing
3 changed files
with
47 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Token generation performance troubleshooting | ||
|
||
## Verifying that the model is running on the GPU with cuBLAS | ||
Make sure you compiled llama with the correct env variables according to [this guide](../README.md#cublas), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: | ||
```shell | ||
./main -m "path/to/model.bin" -ngl 200000 -p "Please sir, may I have some " | ||
``` | ||
|
||
When running llama, before it starts the inference work, it will output diagnostic information that shows whether cuBLAS is offloading work to the GPU. Look for these lines: | ||
```shell | ||
llama_model_load_internal: [cublas] offloading 60 layers to GPU | ||
llama_model_load_internal: [cublas] offloading output layer to GPU | ||
llama_model_load_internal: [cublas] total VRAM used: 17223 MB | ||
... rest of inference | ||
``` | ||
|
||
If you see these lines, then the GPU is being used. | ||
|
||
## Verifying that the CPU is not oversaturated | ||
llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physicial CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down. | ||
|
||
# Example of runtime flags effect on inference speed benchmark | ||
These runs were tested on the following machine: | ||
GPU: A6000 (48GB VRAM) | ||
CPU: 7 physical cores | ||
RAM: 32GB | ||
|
||
Model: `TheBloke_Wizard-Vicuna-30B-Uncensored-GGML/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin` (30B parameters, 4bit quantization, GGML) | ||
|
||
Run command: `./main -m "path/to/model.bin" -p "-p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 1000 [additional benchmark flags]` | ||
|
||
Result: | ||
|
||
| command | tokens/second (higher is better) | | ||
| - | - | | ||
| -ngl 2000000 | N/A (less than 0.1) | | ||
| -t 7 | 1.7 | | ||
| -t 1 -ngl 2000000 | 5.5 | | ||
| -t 7 -ngl 2000000 | 8.7 | | ||
| -t 4 -ngl 2000000 | 9.1 | |