Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454

Open
z3ugma opened this issue Dec 4, 2023 · 25 comments
Open

Comments

@z3ugma
Copy link

z3ugma commented Dec 4, 2023

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb
This notebook:
https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

Trains fine on Google Colab at https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=upI97XEH6EKe

using Python 3.10.12, Torch 2.1.0

It does not train on my workstation - the loss collapses to NaN after just a few epochs:

Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan

My workstation is also using Python 3.10.13, and Torch 2.1.0. What could be causing the loss to be all nan?

@z3ugma
Copy link
Author

z3ugma commented Dec 4, 2023

@younesbelkada you're the author of that sample notebook and the keeper of the football dataset on Hugging Face - any idea what might be causing the loss to go to nan?

@triangle959
Copy link

@z3ugma the same problem. I started to get the nan loss in the 2nd batch of epoch 0. Have you solved it?

@triangle959
Copy link

I found an interesting thing. On Google Colab, the loss will not change to nan. It seems that there are still differences between Colab and the local notebook.

@jeffliu-LL
Copy link

I also ran into this issue recently with finetuning BLIP2, whereas it was working before. I haven't had a chance to pin it down, but it might be a package version issue with something introducing a breaking change?

@jeffliu-LL
Copy link

Rolling back to peft=0.5.0 was able to get the blip2 example working for me

@AntoniaSch
Copy link

@jeffliu-LL which pytorch version are you using?

@jeffliu-LL
Copy link

pytorch 2.0.1 with pytorch-cuda 11.8

@z3ugma
Copy link
Author

z3ugma commented Dec 23, 2023

I will try rolling back to peft 0.5 with cuda 12.2 and Python 3.11.

Will report back

@z3ugma
Copy link
Author

z3ugma commented Dec 24, 2023

No, still a problem:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.1.2+cu121
Datasets 2.16.0
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

@z3ugma
Copy link
Author

z3ugma commented Dec 24, 2023

Still shows all nan after downgrading PyTorch and PEFT unfortunately

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.36.2
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

@z3ugma
Copy link
Author

z3ugma commented Dec 24, 2023

@jeffliu-LL will you put your versions of Python, PyTorch, Transformers, and cuda from the working environment?

@z3ugma
Copy link
Author

z3ugma commented Dec 24, 2023

Here are the packages from the working Google Colab environment:

Working: 
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
PEFT 0.5.0

@z3ugma
Copy link
Author

z3ugma commented Dec 24, 2023

Still not working on Python 3.10
some other version details of another nonworking version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
PEFT 0.5.0
SciPy 1.11.4
Pillow 9.4.0

@bit-bcilab
Copy link

bit-bcilab commented Jan 11, 2024

seems like we meet the same problem :(

OS: Windows 10
CUDA: 11.8
Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] on win32

Torch 2.1.2+cu118
Datasets 2.16.1
Transformers 4.36.2
PEFT 0.7.1
bitsandbytes 0.41.0

@bit-bcilab
Copy link

1

@wushandinghua
Copy link

@z3ugma I meet the similar issue.The loss change to nan after epoch 0. Have you fix it?
dataset: jpawan33/kag100-image-captioning-dataset
pytorch:1.13.0
cuda:11.3
python: 3.9
PEFT: 0.7.2.dev0
transformers: 4.36.2

@z3ugma
Copy link
Author

z3ugma commented Jan 21, 2024

@wushandinghua no, I've not yet had success

@pribadihcr
Copy link

Any solutions? have same problem

@eddie221
Copy link

eddie221 commented Mar 6, 2024

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

@pribadihcr
Copy link

pribadihcr commented Mar 6, 2024

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

@shams2023
Copy link

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 此笔记本: https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

在 Google Colab 上训练效果良好:https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW ?usp=sharing#scrollTo=upI97XEH6EKe

使用Python 3.10.12、Torch 2.1.0

它不在我的工作站上进行训练 - 损失在几个时期后就崩溃为 NaN:

Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan

我的工作站也使用 Python 3.10.13 和 Torch 2.1.0。是什么导致损失全部为nan?

Google Colab

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab.
Thank you very much!

@shams2023
Copy link

shams2023 commented Mar 13, 2024

1

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab.
Thank you !

@eddie221
Copy link

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

Sorry for the late reply.
I also change the model type from torch.float16 to torch.float32
There will need to be two modifications to the code:

  1. model = Blip2ForConditionalGeneration.from_pretrained("ybelkada/blip2-opt-2.7b-fp16-sharded", device_map="auto", load_in_8bit=True, torch_dtype=torch.float32)
  2. pixel_values = batch.pop("pixel_values").to(device, torch.float32)
    Here is the notebook of my testing result:
    https://colab.research.google.com/drive/1j2jey-OqmtUa3IcI1kOcswWAWmiG4JKJ?usp=sharing

@eddie221
Copy link

1

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !

I test with the same code as the https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb on my computer.

@shams2023
Copy link

1

您能否向我发送一份在本地部署您的代码的副本?我也想在我自己的电脑上尝试一下,而不是使用 Google Colab。谢谢!

我使用与计算机上 https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 相同的代码进行测试。

Okay, I will deploy it to my own PyCharm for experimentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants