-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM when loading 300B models with AutoModelForCausalLM.from_pretrained
and BitsAndBytesConfig
quantization.
#31577
Comments
Hi @Neo9061, thanks for reporting this issue and providing the detailed context. I'm maintainer of bitsandbytes and have a few thoughts: Regarding issue 1) From the code you provided, you don't seem to be using any parallel training approach, like FSDP. Is that right? In that case it would be expected that you cannot use the full memory of all combined GPUs and therefore it would be expected that you run OOM. You mentioned that it "should run". To qualify this with a concrete estimate, I did the following back of the envelope math:
However, 186 GB doesn't yet account for the activation: I'm not certain what values to plug in for the activation calculation, but with 186 GB used out of the total 192 GB combined capacity of your 8 GPU AWS instance ( Optimizer states and gradients can also potentially have lower precisions, so it would be important to take that into account (i.e. confirm our calculation assumption or use less precision to save memory in one of your tests). On a side-note, not necessarily related but I want to mention it anyways: We have observed an issue with QLoRA and long I investigated this briefly after we became aware that there might be a memory leak leading to excessive memory consumption for high We have a new engineer, @matthew Douglas, joining the BNB team in July. Once he's on board, we plan to reassess the importance and urgency of this issue. It would be helpful if you would could look into this a bit and if you think it's a blocker, ping us again. Regarding issue 2) I think this is more for @philschmid and the others to answer, as nothing immediately catches my eye. |
Hi @Titus-von-Koeller, thanks for answering my questions! To follow-up, my first issue is OOM during model loading stage, not the model fine-tuning stage. I followed the blogpost https://www.philschmid.de/sagemaker-train-deploy-llama3 which initializes FSDP and here is the training script. Q1. By this line transformers/src/transformers/modeling_utils.py Lines 4217 to 4232 in dd4654e
low_cpu_mem_usage=True (is that True? and it seems uses single GPU - rank 0 - to load the entire quantized models).
On the other end, by using @philschmid 's blogpost and code, I am able to load and train two models:
Neither Q2. For the memory you computed, is it for model loading or model fine-tuning? my understanding is that the memory for activations, optimizers, and gradients are all not required at model loading stage by Q3. For fomula of activations: |
cc @matthewdouglas who has taken over the lead on this task |
Up! Are there any suggestions on this issue?? |
@Neo9061 @thepowerfuldeez See the PR on #32276. The observation here is that weights would be offloaded to CPU memory for all ranks instead of just one (e.g. 8x CPU memory requirement on the g5.48xlarge and p4d.24xlarge mentioned in the original issue). This usage goes back down after the model is loaded, so a temporary workaround could be to create additional swap space on local NVMe storage. In addition to this, I'm testing out some further changes to enable the usage of prequantized checkpoints with FSDP+QLoRA. |
Since the PR was merged but then reverted, @matthewdouglas is there another PR we can follow for this feature ? |
New PR: #33154 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
New pr was merged, closing |
System Info
My goal is to follow Distributed fine-tuning blogpost with FSDP to test with distributed fine-tuning on larger size of model like 300B Grok-1.
Context is that I have tried g5.48xlarge (8 GPUs with 192 GB and 768 GB CPU) and p4d.24xlarge (8 GPUs. with 320 GB and 1152 GB CPU). There are two issues listed as following.
Transformer version is:
transformers==4.40.0
Issue 1
When I tried to load the model with 4 bits quantization with code below (WITHOUT FSDP and it is purely on a EC2 of g5.48xlarge), the total GPU memory required should be around 150GB (since model is ~300B Grok-1), which is smaller than 192GB GPU memory of g5.48xlarge, but I hit OOM. If I turn on
low_cpu_mem_usage=True
, then the model can be successfully loaded on CPU in the EC2 of g5.48xlarge. Same error happens atp4d.24xlarge
where 4 bit quantization is failed at loading.Issue 2
Continue on point 1, i think I find a path forward to load the model into CPU by setting
low_cpu_mem_usage=True
. Follow the blogpost above, I start try SageMaker training job and I try to load this model using the default qlora_fsdp script, shown in the blog. Further, I disabled the quantization (as the quantization will load the model into GPUs but it failed in the point 1). Since when FSDP is enabled, it will by default uselow_cpu_mem_usage=True
according to this line. However, I hit timeout issue even after I modified training argumentddp_timeout
to be 10800.The model checkpoints are loaded twice and failed at second time of loading.
Who can help?
@philschmid @SunMarc @lewtun @sgugger @ArthurZucker @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Same as above
Expected behavior
Should be no OOM
The text was updated successfully, but these errors were encountered: