-
-
Notifications
You must be signed in to change notification settings - Fork 813
Description
Disclaimer:
I have researched this extensively and this may not be an error with current bitsandbytes. There are so many interlocking packages that the error may be the result of my current combination.
Having said that I have not been able to properly split the total available gpu memory using a custom device_map and load_in_8bit=True when trying to load a large language model like "EleutherAI/gpt-neox-20b"
I am only able to max the memory of gpu0 @ 22+ gigs with almost 3 gigs being loaded on gpu1. Stable on my machine. No OOM
This custom memory map works with load_in_8bit=True:
chip_map= {'gpt_neox.embed_in': 1,
'gpt_neox.layers': 0,
'gpt_neox.final_layer_norm': 1,
'embed_out': 1}
Can switch to "my_map" which is "chip_map" with same layer-by-layer gpu 0 or 1 assignment.
Unfortunately, changing single gpt_neox.layer from 0 to 1 generates the error:
Exception: cublasLt ran into an error!
my_map= {'gpt_neox.embed_in': 1,
'gpt_neox.layers.0': 1, # cuBLAS API failed with status 15
'gpt_neox.layers.1': 1,
etc.
'gpt_neox.layers.42': 0,
'gpt_neox.layers.43': 0,
'gpt_neox.final_layer_norm': 1,
'embed_out': 1}
Can split the layers across 2 gpus using fp16 but have no success doing the same with 8bit.
Monitoring in realtime 'watch -n0.1 nvidia-smi' shows the memory being loaded into each gpu but before 'cuBLAS API failed with status 15' error.
I use local models only (no hub)
I've attached a simplified version of '20b_8bit.py' script that shows examples of each device_map and simple generation from fixed prompt.
Current platform and packages:
2x RTX 24 gig cards, Ubuntu 22.04, cuda 12.0, python 3.10.6, transformers 4.25.0.dev0, torch 1.13.1, accelerate 0.15.0, bitsandbytes 0.37.0 (built for cuda 12.0)
If this is unique to my system then someone might be able to advise on which scripts and how to modify so that I can rebuild.
Error report follows. Line breaks added for clarity.
python3 20b_8bit.py
4.25.0.dev0
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.0/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 120
CUDA SETUP: Loading binary /home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda120.so...
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:0 for open-end generation.
cuBLAS API failed with status 15
A: torch.Size([113, 6144]), B: torch.Size([18432, 6144]), C: (113, 18432); (lda, ldb, ldc): (c_int(3616), c_int(589824), c_int(3616)); (m, n, k): (c_int(113), c_int(18432), c_int(6144))
error detectedTraceback (most recent call last):
File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/20b_8bit.py", line 87, in
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, top_p=0.95, max_new_tokens=128)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/generation/utils.py", line 1576, in generate
return self.sample(
File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/generation/utils.py", line 2536, in sample
outputs = self(
File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 654, in forward
outputs = self.gpt_neox(
File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 546, in forward
outputs = layer(
File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 319, in forward
attention_layer_outputs = self.attention(
File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 115, in forward
qkv = self.query_key_value(hidden_states)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!
20b_8bit.py.zip