Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support llama.cpp "Multi GPU support, CUDA refactor, CUDA scratch buffer" #344

Closed
wyhanz opened this issue Jun 8, 2023 · 8 comments
Closed
Labels
enhancement New feature or request llama.cpp Problem with llama.cpp shared lib

Comments

@wyhanz
Copy link

wyhanz commented Jun 8, 2023

Multi-GPU inference is essential for small VRAM GPU. 13B llama model cannot fit in a single 3090 unless using quantization.

llama.cpp yesterday merge multi gpu branch, which help us using small VRAM GPUS to deploy LLM.
ggerganov/llama.cpp#1703

Hope llama-cpp-python can support multi GPU inference in the future.
Many thanks!!!

@wyhanz wyhanz changed the title Supporting llama.cpp "Multi GPU support, CUDA refactor, CUDA scratch buffer" Support llama.cpp "Multi GPU support, CUDA refactor, CUDA scratch buffer" Jun 8, 2023
@gjmulder gjmulder added enhancement New feature or request llama.cpp Problem with llama.cpp shared lib labels Jun 8, 2023
@abetlen
Copy link
Owner

abetlen commented Jun 8, 2023

Pushed to v0.1.59, let me know if that works.

@devilteo911
Copy link

devilteo911 commented Jun 8, 2023

Can I use this with the High Level API or is it available only in the Low Level ones?

@gjmulder
Copy link
Contributor

gjmulder commented Jun 8, 2023

Working for me with the high level API.

$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.59 --no-cache-dir
Collecting llama-cpp-python==0.1.59
  Downloading llama_cpp_python-0.1.59.tar.gz (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 69.9 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in /home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages (from llama-cpp-python==0.1.59) (4.6.0)
Requirement already satisfied: diskcache>=5.6.1 in /home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages/diskcache-5.6.1-py3.10.egg (from llama-cpp-python==0.1.59) (5.6.1)
Requirement already satisfied: numpy>=1.20.0 in /home/mulderg/anaconda3/envs/lcp/lib/python3.10/site-packages (from llama-cpp-python==0.1.59) (1.24.3)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... done
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.1.59-cp310-cp310-linux_x86_64.whl size=310301 sha256=2182deb938214949a0838f7b4f649cd78636f396450a48686d2fe07dabcd1d81
  Stored in directory: /data/tmp/pip-ephem-wheel-cache-pps732gv/wheels/ae/3d/33/588d5327568faa38106293c3de8fc5ba0c3cf514d4f6eec9c5
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
  Attempting uninstall: llama-cpp-python
    Found existing installation: llama-cpp-python 0.1.57
    Uninstalling llama-cpp-python-0.1.57:
      Successfully uninstalled llama-cpp-python-0.1.57
Successfully installed llama-cpp-python-0.1.59

$ pip list | grep llama-cpp-python
llama-cpp-python         0.1.59

$ python ./smoke_test.py -f genesis50.txt -l 5 > /dev/null
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti
  Device 1: NVIDIA GeForce GTX 1080 Ti
llama.cpp: loading model from /data/llama/7B/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 8192
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
llama_model_load_internal: mem required  = 2292.09 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 12865 MB
...................................................................................................
llama_init_from_file: kv self size  = 4096.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =   594.61 ms
llama_print_timings:      sample time =  1273.07 ms /  2048 runs   (    0.62 ms per token)
llama_print_timings: prompt eval time =   594.55 ms /   109 tokens (    5.45 ms per token)
llama_print_timings:        eval time = 215376.56 ms /  2047 runs   (  105.22 ms per token)
llama_print_timings:       total time = 269666.63 ms

@wyhanz
Copy link
Author

wyhanz commented Jun 8, 2023

Can I use this with the High Level API or is it available only in the Low Level ones?

Check class Llama, the parameter in __init__() (n_parts: Number of parts to split the model into. If -1, the number of parts is automatically determined.) can realize the feature. And I think high-level api is just a wrapper for low-level api to help us use more easily

@wyhanz
Copy link
Author

wyhanz commented Jun 8, 2023

Here is a brief experimental result.

I used ziya, a large-scale pre-trained model based on LLaMA with 13 billion parameters and quantized to q8_0.

The model was successfully divided into two GPUs. However, I feel that the same prompts are slower than using single 3090, perhaps need more experiments.

ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
  Device 1: NVIDIA GeForce RTX 3090
llama.cpp: loading model from /data0/zhangyuhan/llama_cpp/ggml-model-q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 39424
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  = 2457.17 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 layers to GPU
llama_model_load_internal: total VRAM used: 13370 MB
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 30%   44C    P2   117W / 370W |   7189MiB / 24576MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:68:00.0 Off |                  N/A |
| 52%   66C    P2   157W / 370W |  23693MiB / 24576MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2653794      C   python                           7186MiB |
|    1   N/A  N/A   2260259      C   python                          16496MiB |
|    1   N/A  N/A   2653794      C   python                           7186MiB |
+-----------------------------------------------------------------------------+

The inference time was slower than before probably because I was running the evaluation code on another card at the same time 😄

@gjmulder
Copy link
Contributor

gjmulder commented Jun 8, 2023

The model was successfully divided into two GPUs. However, I feel that the same prompts are slower than using single 3090, perhaps need more experiments.

Yeah, I got worse performance using two GPUs than using one GPU as well. However, my second GPU is a 1080Ti, so I assumed it was my environment.

@AlphaAtlas
Copy link

@wyzhangyuhan I think that is correct behavior? If its like the CPU/GPU spit, the gpus will run sequentially, not in parallel. It will run some on one 3090, and then on the other.

But I think you can fit Q4 65B on 2x 3090s, with very minimal quality loss 👍

@wyhanz

This comment was marked as off-topic.

@wyhanz wyhanz closed this as completed Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request llama.cpp Problem with llama.cpp shared lib
Projects
None yet
Development

No branches or pull requests

5 participants