How to use CUDA or BLAS #1070

maddes8cht · 2023-04-19T19:17:39Z

maddes8cht
Apr 19, 2023

With the master-8944a13 -
Add NVIDIA cuBLAS support (#1044) i looked forward if i can see any differences.
Sadly, i don't.
I cannot even see that my rtx 3060 is beeing used in any way at all by llama.cpp's main.exe on Windows, using the win-avx2 version.
Is there anything that needs to be switched on to use cuda?

The system-Info line of main.exe shows like this:

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

So why is BLAS=0 ?
Is there anything needed to use BLAS?

slaren · 2023-04-19T19:26:54Z

slaren
Apr 19, 2023
Maintainer

There are no pre-built binaries with cuBLAS at the moment, you have to build it yourself.
Download the CUDA Tookit from https://developer.nvidia.com/cuda-downloads and add the parameter -DLLAMA_CUBLAS=ON to cmake.
For example, if following the instructions from https://github.com/ggerganov/llama.cpp#build replace
cmake ..
with
cmake .. -DLLAMA_CUBLAS=ON
You will also need the visual studio build tools to do this.

1 reply

maddes8cht Apr 19, 2023
Author

so this will not work with msys2 compilers?

ridwanarf25 · 2023-04-20T04:43:03Z

ridwanarf25
Apr 20, 2023

I did git pull with this version master-02d6988
and tried to build with make in linux but i get this error ~~but it's all good when I use CMake, just letting you know~~

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64
I CC:       cc (Ubuntu 10.4.0-5ubuntu2) 10.4.0
I CXX:      g++ (Ubuntu 10.4.0-5ubuntu2) 10.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
llama.cpp:67:2: warning: extra ‘;’ [-Wpedantic]
   67 | };
      |  ^
llama.cpp:79:2: warning: extra ‘;’ [-Wpedantic]
   79 | };
      |  ^
llama.cpp:92:2: warning: extra ‘;’ [-Wpedantic]
   92 | };
      |  ^
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
nvcc -arch=native -c -o ggml-cuda.o ggml-cuda.cu
nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:108: ggml-cuda.o] Error 1

Turn out in nvcc --help and on --gpu-architecture the allowed values doesn't have the 'native' value
this is what allowed values on my --gpu-architecture

Allowed values for this option:  'all','all-major','compute_35','compute_37',
        'compute_50','compute_52','compute_53','compute_60','compute_61','compute_62',
        'compute_70','compute_72','compute_75','compute_80','compute_86','compute_87',
        'lto_35','lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62',
        'lto_70','lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50',
        'sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80',
        'sm_86','sm_87'

This is my nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

BUT its all fine if I use CMake to build. but I still prefer use make btw

first i tried test it with just ./main -m ./vicuna-7B-1.1-GPTQ-4bit-128g-GGML.bin it works but the output is gibberish
I tried run the Miku.sh with different model path using vicuna-7B-1.1-GPTQ-4bit-128g-GGML.bin it give me this error
CUDA error 209 at /home/../llama.cpp/ggml.c:7775: no kernel image is available for execution on the device

1 reply

slaren Apr 20, 2023
Maintainer

-arch=native is explicitly supported in the documentation for automatically detecting the GPU in your machine, but I imagine it was only added in a recent version. It works with CUDA toolkit version 12.

exppii · 2023-04-20T07:09:26Z

exppii
Apr 20, 2023

I did git pull with this version master( 02d6988)

(base) [root@A12-213P llama.cpp]# LLAMA_CUBLAS=1 make
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64
I CC:       cc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
I CXX:      g++ (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
nvcc -arch=native -c -o ggml-cuda.o ggml-cuda.cu
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64
/opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /usr/local/cuda/lib64/libcudart_static.a(cudart_static.o): in function `__cudart861':
(.text+0x6e192): undefined reference to `shm_unlink'
/opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: (.text+0x6e1ad): undefined reference to `shm_open'
/opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: (.text+0x6e243): undefined reference to `shm_unlink'
/opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /usr/local/cuda/lib64/libcudart_static.a(cudart_static.o): in function `__cudart620':
(.text+0x6e3e5): undefined reference to `shm_open'
/opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /usr/local/cuda/lib64/libcudart_static.a(cudart_static.o): in function `__cudart747':
(.text+0x6e575): undefined reference to `shm_open'
/opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /usr/local/cuda/lib64/libcudart_static.a(cudart_static.o): in function `__cudart635':
(.text+0x6e6f4): undefined reference to `shm_unlink'
collect2: error: ld returned 1 exit status

Then I update the Makefile fix this issue

diff --git a/Makefile b/Makefile
index 4bf481a..5311954 100644
--- a/Makefile
+++ b/Makefile
@@ -102,7 +102,7 @@ ifdef LLAMA_OPENBLAS
 endif
 ifdef LLAMA_CUBLAS
        CFLAGS  += -DGGML_USE_CUBLAS -I/usr/local/cuda/include
-       LDFLAGS += -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64
+       LDFLAGS += -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64 -lrt
        OBJS    += ggml-cuda.o
 ggml-cuda.o: ggml-cuda.cu ggml-cuda.h

I'm not sure if this is a suitable way.

1 reply

fumiama Apr 20, 2023

fix in #1080

randaller · 2023-04-20T07:13:09Z

randaller
Apr 20, 2023

May be, add CUDA version binary to a release?

1 reply

botchi09 Apr 21, 2023

Please! I can't get the thing to work. Seems I need an outdated version of Visual Studio.

jparmstr · 2023-04-21T14:57:48Z

jparmstr
Apr 21, 2023

Trying to compile with BLAS support was very painful for me on Windows. I spent a few hours trying to make it work. I tried with the Intel MKL / OneApi version and with OpenBLAS. I could never get CMake to recognize my BLAS libraries no matter what I did.

I eventually found this repository which provides a pre-compiled Llama.cpp with BLAS already enabled:
https://github.com/LostRuins/koboldcpp

3 replies

Free-Radical May 6, 2023

Welcome to Windoze. Linux is the way!

Green-Sky May 6, 2023
Collaborator

we do provide cublas for windows bianries, and clblast and openblas are on their way. :)
release with cublas: https://github.com/ggerganov/llama.cpp/releases/tag/master-a3b85b2
wip releas with openblas and clblast https://github.com/SlyEcho/llama.cpp/releases/tag/allblas-66aa023

Green-Sky May 6, 2023
Collaborator

I tried with the Intel MKL / OneApi version and with OpenBLAS. I could never get CMake to recognize my BLAS libraries no matter what I did.

this is something we could fix

Priestru · 2023-04-21T16:25:38Z

Priestru
Apr 21, 2023

Okay, i spent several hours trying to make it work. So few ideas.

Make sure your VS tools are those CUDA integrated to during install. The best solution would be to delete all VS and CUDA. Then delete any CMakeCache.txt that you could have. After that install VS and then Cuda and basically it should begin to work.

But for me it didn't happen tho. I have BLAS = 1, but have no performance increase at all. My GPU isn't running any calcs too, i have veeeery short spike of activity but it doesn't affect eval time at all. No idea why.

6 replies

Priestru Apr 21, 2023

Is there any test i can do to check if my BLAS actually works?

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

slaren Apr 21, 2023
Maintainer

You can test it with perplexity or with main with -b 512 and a large prompt. BLAS is only used when the batch size is at least 32, but the default is 8.

Priestru Apr 21, 2023

I guess i have it working then

llama_print_timings: prompt eval time = 13202.53 ms / 512 tokens ( 25.79 ms per token)

Priestru Apr 21, 2023

I struggle to imagine why nobody ever written anywhere about batch size requirements. Thank you very much you are lifesaver

ferdytao May 11, 2023

How did it worked? I compiled with cuBLAS and CUDA toolkit but GPU load is always close to 0%, how can fix it?

ukiyocode · 2023-04-24T09:35:25Z

ukiyocode
Apr 24, 2023

For those who struggle with windows build:

delete CMakeCache.txt from your llama cpp folder
if you're getting the "no CUDA toolset found" error, make sure you checked visual studio integration when installing cuda toolkit and then copy the files from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vv12.1\extras\visual_studio_integration\MSBuildExtensions to C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170

0 replies

kenfink · 2023-04-29T16:54:02Z

kenfink
Apr 29, 2023

Do the binary releases now contain the cuBLAS code? It looks like it was all merged into releases starting last night:
https://github.com/ggerganov/llama.cpp/releases/tag/master-0b5a935

If so, is there a command line switch or environment variable to get the binary to notice cuBLAS?

8 replies

Priestru Apr 29, 2023

Feel like this would make life so much easier tho. In case of llama.cpp itself it builds just fine for me, but once i need .so or .dll, it becomes increasingly hard. I can make working cuBLAS with libllama.so for python binding, but .dll never for once did work. It just doesn't load models anymore if i try to feed it some CUDA.

Priestru Apr 29, 2023

Thank you!! Is it buildable on Windows 11 with Make? In native or do we need to build it in WSL2? I have CUDA 12.1 & Toolkit installed and can see the cublas_v2.h file in the folder.

I'm trying to use "make LLAMA_CUBLAS=1" and make can't find cublas_v2.h despite adding to the PATH and adjusting with the Makefile to point directly at the files. Is the Makefile expecting linux dirs not Windows?

Just having CUDA toolkit isn't enough. CUDA must be installed last (after VS) and be connected to it via CUDA VS integration. Also make sure that you don't have any extra CUDA anywhere. The safest way is to delete all vs and cuda related stuff and properly install it in order

slaren Apr 29, 2023
Maintainer

I also noticed that it may not work at all if you install the VS build tools, you need to install Visual Studio Community 2019, otherwise the CUDA toolkit will not install properly.

kenfink Apr 30, 2023

Can confirm! I was only running VS Code, not Community which is a full IDE. I installed Visual Studio Community 2022 and then reinstalled CUDA, and ran Cmake from within VS Community ... and it compiles! Woot!

These are good heads-ups for future installation instructions.

On the other hand, inference is suuuuuper slow compared to release binaries (master-c3ca7a5) which appear to be CPU Only. Is there a bottleneck somewhere? Or is this about batch size? I'm using 512. Seeing very little usage on the RTX 3090 GPU but only a few percent and only a few GB of vRAM used.

If cuBLAS only speeds up big prompts, why is it so much slower for small ones?

Interestingly, perplexity tests on both looked as originally hoped, with GPU speeding up wiki.test.raw:
cuBLAS: perplexity : calculating perplexity over 655 chunks, batch_size=512
19.29 seconds per pass - ETA 3 hours 30 minutes (GPU ~25% compute usage)
CPU: perplexity : calculating perplexity over 655 chunks, batch_size=512
51.03 seconds per pass - ETA 9 hours 17 minutes

Thanks for this amazing work!

slaren Apr 30, 2023
Maintainer

cuBLAS shouldn't be used at all for generation, it is only used with large batches. The performance should be the same as without it. I would check that you are using the same optimization flags with both builds.

flying-wizzy · 2024-01-28T15:47:54Z

flying-wizzy
Jan 28, 2024

I recommend you follow this article here: https://medium.com/@piyushbatra1999/installing-llama-cpp-python-with-nvidia-gpu-acceleration-on-windows-a-short-guide-0dfac475002d

But change the terminal commands for reinstalling llama-cpp-python to the following, or else it does not work (at least for me it didn't):

$env:CMAKE_ARGS="-DLLAMA_CUBLAS=on" 
$env:FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

0 replies

cyberluke · 2025-02-18T03:05:07Z

cyberluke
Feb 18, 2025

fucking stupiud

CUDA is something else than CUBLAS....

why not -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=CUBLAS idiota

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
CMake Error at CMakeLists.txt:105 (message):
LLAMA_CUBLAS is deprecated and will be removed in the future.

Use GGML_CUDA instead

Call Stack (most recent call first):
CMakeLists.txt:110 (llama_option_depr)

-- Configuring incomplete, errors occurred!

CMakeLists.txt:105 ==> at 04:00 AM !!! Idiota - why no `cmake clean` work?

master ~/llama.cpp cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DLLAMA_CUBLAS=ON

BUILD.md must be written by grandma on ADHD when falling from the stairs (I have autism and it is spaghetti from the hell)

CMake Error at CMakeLists.txt:105 (message):
LLAMA_CUBLAS is deprecated and will be removed in the future.

Use GGML_CUDA instead

Call Stack (most recent call first):
CMakeLists.txt:110 (llama_option_depr)

-- Configuring incomplete, errors occurred!

0 replies

cyberluke · 2025-02-18T03:27:16Z

cyberluke
Feb 18, 2025

I don't know, the author is teenager that he don't remember how guys do make and cmake? I'm 1987, that is 37 yrs old and why cmake clean cannot run rm -rf ./build ? If you use cmake, you must support also basic API like cmake clean otherwise you don't have license to use cmake and I have license to kill 007 :->

0 replies

How to use CUDA or BLAS #1070

Replies: 11 comments · 21 replies

slaren Apr 19, 2023 Maintainer

maddes8cht Apr 19, 2023 Author

slaren Apr 20, 2023 Maintainer

Green-Sky May 6, 2023 Collaborator

Green-Sky May 6, 2023 Collaborator

slaren Apr 21, 2023 Maintainer

slaren Apr 29, 2023 Maintainer

slaren Apr 30, 2023 Maintainer

CMakeLists.txt:105 ==> at 04:00 AM !!! Idiota - why no cmake clean work?

Replies: 11 comments 21 replies

slaren
Apr 19, 2023
Maintainer

maddes8cht Apr 19, 2023
Author

slaren Apr 20, 2023
Maintainer

Green-Sky May 6, 2023
Collaborator

Green-Sky May 6, 2023
Collaborator

slaren Apr 21, 2023
Maintainer

slaren Apr 29, 2023
Maintainer

slaren Apr 30, 2023
Maintainer

CMakeLists.txt:105 ==> at 04:00 AM !!! Idiota - why no `cmake clean` work?