-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applying lora with CUDA crashes with failed assertion #1846
Comments
Same problem here. I'm on ubuntu 20.04 using nvidia driver version: 530.30.02, CUDA version: 12.1, and an M40 GPU. Just as a possible solution I've tried totally wiping nvidia drivers and cuda from my system, doing a reinstall of them, compiling llama.gcc again and....no change. Still crashes if I use both a lora and GPU at the same time. Still works if I use GPU and no lora, or no GPU and a lora. |
Just curious, does it still crash without |
For me at least, yep, I still get the crash if I don't use lora-base. |
Weird. I was playing with LoRA earlier today and didn't have that issue (but I was only using cuBLAS for the prompt, not offloading layers). A big pull that changes CUDA stuff just got merged a couple minutes ago. You could try pulling and recompiling, see if it randomly fixed your issue. |
I'm on it. |
Just did a fresh pull, make clean and LLAMA_CUBLAS=1 make. No changes with the crash, I'm afraid. But thanks to everyone trying to figure it out! |
I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would require quite a few changes to get good performance out of GPU-accelerated LoRAs. But if someone wants good performance they'll merge the LoRA anyways. Maybe once I implement better f16 support for something else I'll revisit this. |
Thank you so much for digging into it! It's a relief just knowing what's going on there. This is the first I'd seen anyone else mention it, and I was really starting to think that I was messing something up somewhere. |
Thanks for looking into this and finding the issue. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I am running the latest code. Development is very rapid so there are no tagged versions as of now.
Running with CPU only with lora runs fine.
$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin
main: build = 669 (9254920)
main: seed = 1686722870
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 13189.95 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 15237.95 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 512 MB
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00
llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin'
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
.................... done (64362.93 ms)
system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0
This is a test prompted the voice. It said other things, but I couldn't understand them or remember them later if they were important.
llama_print_timings: load time = 70609.41 ms
llama_print_timings: sample time = 23.21 ms / 25 runs ( 0.93 ms per token)
llama_print_timings: prompt eval time = 688.94 ms / 6 tokens ( 114.82 ms per token)
llama_print_timings: eval time = 6819.37 ms / 24 runs ( 284.14 ms per token)
llama_print_timings: total time = 7542.23 ms
./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --n-gpu-layers 30
main: build = 669 (9254920)
main: seed = 1686723899
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
[snip]
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 5594.59 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 30 layers to GPU
llama_model_load_internal: total VRAM used: 10156 MB
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0
This is a test prompt for the title.
You are reading "Testing the Title" [end of text]
llama_print_timings: load time = 4321.42 ms
llama_print_timings: sample time = 7.74 ms / 15 runs ( 0.52 ms per token)
llama_print_timings: prompt eval time = 403.46 ms / 6 tokens ( 67.24 ms per token)
llama_print_timings: eval time = 1738.15 ms / 14 runs ( 124.15 ms per token)
llama_print_timings: total time = 2153.10 ms
$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin --n-gpu-layers 1
[snip]
llama_model_load_internal: ggml ctx size = 13189.95 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 14916.51 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 1 layers to GPU
llama_model_load_internal: total VRAM used: 834 MB
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00
llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin'
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
...................GGML_ASSERT: ggml.c:14307: tensor->src1 == NULL || tensor->src1->backend == GGML_BACKEND_CPU
Aborted (core dumped)
The text was updated successfully, but these errors were encountered: