Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying lora with CUDA crashes with failed assertion #1846

Closed
1 task done
d-takemori opened this issue Jun 14, 2023 · 10 comments
Closed
1 task done

Applying lora with CUDA crashes with failed assertion #1846

d-takemori opened this issue Jun 14, 2023 · 10 comments
Labels
stale wontfix This will not be worked on

Comments

@d-takemori
Copy link

d-takemori commented Jun 14, 2023

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.

  • Running with CPU only with lora runs fine.

$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin
main: build = 669 (9254920)
main: seed = 1686722870
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 13189.95 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 15237.95 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 512 MB
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00
llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin'
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
.................... done (64362.93 ms)

system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0

This is a test prompted the voice. It said other things, but I couldn't understand them or remember them later if they were important.
llama_print_timings: load time = 70609.41 ms
llama_print_timings: sample time = 23.21 ms / 25 runs ( 0.93 ms per token)
llama_print_timings: prompt eval time = 688.94 ms / 6 tokens ( 114.82 ms per token)
llama_print_timings: eval time = 6819.37 ms / 24 runs ( 284.14 ms per token)
llama_print_timings: total time = 7542.23 ms

  • Running same command with GPU offload and NO lora works:

./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --n-gpu-layers 30
main: build = 669 (9254920)
main: seed = 1686723899
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
[snip]
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 5594.59 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 30 layers to GPU
llama_model_load_internal: total VRAM used: 10156 MB
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB

system_info: n_threads = 18 / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 25, n_keep = 0

This is a test prompt for the title.
You are reading "Testing the Title" [end of text]

llama_print_timings: load time = 4321.42 ms
llama_print_timings: sample time = 7.74 ms / 15 runs ( 0.52 ms per token)
llama_print_timings: prompt eval time = 403.46 ms / 6 tokens ( 67.24 ms per token)
llama_print_timings: eval time = 1738.15 ms / 14 runs ( 124.15 ms per token)
llama_print_timings: total time = 2153.10 ms

  • Running with lora AND with ANY number of layers offloaded to GPU causes crash with assertion failed

$ ./main --n-predict 25 --model /data/LLaMA/13B/ggml-model-q8_0.bin --prompt "This is a test prompt" --lora /data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin --lora-base /data/LLaMA/13B/ggml-model-f16.bin --n-gpu-layers 1
[snip]
llama_model_load_internal: ggml ctx size = 13189.95 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 14916.51 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 1 layers to GPU
llama_model_load_internal: total VRAM used: 834 MB
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
llama_apply_lora_from_file_internal: applying lora adapter from '/data/LLaMA/loras/llama-13b_test-lora/ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file_internal: r = 96, alpha = 192, scaling = 2.00
llama_apply_lora_from_file_internal: loading base model from '/data/LLaMA/13B/ggml-model-f16.bin'
llama.cpp: loading model from /data/LLaMA/13B/ggml-model-q8_0.bin
...................GGML_ASSERT: ggml.c:14307: tensor->src1 == NULL || tensor->src1->backend == GGML_BACKEND_CPU
Aborted (core dumped)

@EmerJK
Copy link

EmerJK commented Jun 14, 2023

Same problem here. I'm on ubuntu 20.04 using nvidia driver version: 530.30.02, CUDA version: 12.1, and an M40 GPU. Just as a possible solution I've tried totally wiping nvidia drivers and cuda from my system, doing a reinstall of them, compiling llama.gcc again and....no change. Still crashes if I use both a lora and GPU at the same time. Still works if I use GPU and no lora, or no GPU and a lora.

@KerfuffleV2
Copy link
Collaborator

Just curious, does it still crash without --lora-base?

@EmerJK
Copy link

EmerJK commented Jun 14, 2023

Just curious, does it still crash without --lora-base?

For me at least, yep, I still get the crash if I don't use lora-base.

@KerfuffleV2
Copy link
Collaborator

Weird. I was playing with LoRA earlier today and didn't have that issue (but I was only using cuBLAS for the prompt, not offloading layers). A big pull that changes CUDA stuff just got merged a couple minutes ago. You could try pulling and recompiling, see if it randomly fixed your issue.

@JohannesGaessler
Copy link
Collaborator

I'm on it.

@JohannesGaessler JohannesGaessler self-assigned this Jun 14, 2023
@EmerJK
Copy link

EmerJK commented Jun 14, 2023

Just did a fresh pull, make clean and LLAMA_CUBLAS=1 make. No changes with the crash, I'm afraid. But thanks to everyone trying to figure it out!

@JohannesGaessler
Copy link
Collaborator

I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would require quite a few changes to get good performance out of GPU-accelerated LoRAs. But if someone wants good performance they'll merge the LoRA anyways. Maybe once I implement better f16 support for something else I'll revisit this.

@JohannesGaessler JohannesGaessler removed their assignment Jun 14, 2023
@JohannesGaessler JohannesGaessler added the wontfix This will not be worked on label Jun 14, 2023
@EmerJK
Copy link

EmerJK commented Jun 14, 2023

I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would require quite a few changes to get good performance out of GPU-accelerated LoRAs.

Thank you so much for digging into it! It's a relief just knowing what's going on there. This is the first I'd seen anyone else mention it, and I was really starting to think that I was messing something up somewhere.

@d-takemori
Copy link
Author

Thanks for looking into this and finding the issue.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants