Cannot split total GPU memory between two cards using custom device_map and load_in_8bit=True

Disclaimer:

I have researched this extensively and this may not be an error with current bitsandbytes. There are so many interlocking packages that the error may be the result of my current combination.

Having said that I have not been able to properly split the total available gpu memory using a custom device_map and load_in_8bit=True when trying to load a large language model like "EleutherAI/gpt-neox-20b" 

I am only able to max the memory of gpu0 @ 22+ gigs with almost 3 gigs being loaded on gpu1. Stable on my machine. No OOM

This custom memory map works with load_in_8bit=True:

chip_map= {'gpt_neox.embed_in': 1,
 'gpt_neox.layers': 0,
 'gpt_neox.final_layer_norm': 1,
 'embed_out': 1}
 
Can switch to "my_map" which is "chip_map" with same layer-by-layer gpu 0 or 1 assignment. 

Unfortunately, changing single gpt_neox.layer from 0 to 1 generates the error:

Exception: cublasLt ran into an error!

my_map= {'gpt_neox.embed_in': 1, 
 'gpt_neox.layers.0': 1, # cuBLAS API failed with status 15
 'gpt_neox.layers.1': 1, 
 etc. 
 'gpt_neox.layers.42': 0, 
 'gpt_neox.layers.43': 0, 
 'gpt_neox.final_layer_norm': 1, 
 'embed_out': 1}
 
Can split the layers across 2 gpus using fp16 but have no success doing the same with 8bit.

Monitoring in realtime 'watch -n0.1 nvidia-smi' shows the memory being loaded into each gpu but before 'cuBLAS API failed with status 15' error.

I use local models only (no hub)

I've attached a simplified version of '20b_8bit.py' script that shows examples of each device_map and simple generation from fixed prompt.
 
Current platform and packages:

2x RTX 24 gig cards, Ubuntu 22.04, cuda 12.0, python 3.10.6, transformers 4.25.0.dev0, torch 1.13.1, accelerate 0.15.0, bitsandbytes 0.37.0 (built for cuda 12.0)

If this is unique to my system then someone might be able to advise on which scripts and how to modify so that I can rebuild.

Error report follows. Line breaks added for clarity.

python3 20b_8bit.py

4.25.0.dev0

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.0/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 120
CUDA SETUP: Loading binary /home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda120.so...

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

cuBLAS API failed with status 15

A: torch.Size([113, 6144]), B: torch.Size([18432, 6144]), C: (113, 18432); (lda, ldb, ldc): (c_int(3616), c_int(589824), c_int(3616)); (m, n, k): (c_int(113), c_int(18432), c_int(6144))

error detectedTraceback (most recent call last):
  File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/20b_8bit.py", line 87, in <module>
    gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, top_p=0.95, max_new_tokens=128)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/generation/utils.py", line 1576, in generate
    return self.sample(
  File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/generation/utils.py", line 2536, in sample
    outputs = self(
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 654, in forward
    outputs = self.gpt_neox(
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 546, in forward
    outputs = layer(
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 319, in forward
    attention_layer_outputs = self.attention(
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/ubuntubox/5946c983-8717-448c-9007-488d4e825643/8bit_sampling/transformers/models/gpt_neox/modeling_gpt_neox.py", line 115, in forward
    qkv = self.query_key_value(hidden_states)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/ubuntubox/.local/lib/python3.10/site-packages/bitsandbytes-0.37.0-py3.10.egg/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')

Exception: cublasLt ran into an error!
[20b_8bit.py.zip](https://github.com/TimDettmers/bitsandbytes/files/10892667/20b_8bit.py.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cannot split total GPU memory between two cards using custom device_map and load_in_8bit=True #177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Cannot split total GPU memory between two cards using custom device_map and load_in_8bit=True #177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions