Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454

z3ugma · 2023-12-04T23:09:56Z

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb
This notebook:
https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

Trains fine on Google Colab at https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=upI97XEH6EKe

using Python 3.10.12, Torch 2.1.0

It does not train on my workstation - the loss collapses to NaN after just a few epochs:

Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan

My workstation is also using Python 3.10.13, and Torch 2.1.0. What could be causing the loss to be all nan?

The text was updated successfully, but these errors were encountered:

z3ugma · 2023-12-04T23:10:27Z

@younesbelkada you're the author of that sample notebook and the keeper of the football dataset on Hugging Face - any idea what might be causing the loss to go to nan?

triangle959 · 2023-12-13T06:36:39Z

@z3ugma the same problem. I started to get the nan loss in the 2nd batch of epoch 0. Have you solved it?

triangle959 · 2023-12-13T08:18:39Z

I found an interesting thing. On Google Colab, the loss will not change to nan. It seems that there are still differences between Colab and the local notebook.

jeffliu-LL · 2023-12-14T18:36:24Z

I also ran into this issue recently with finetuning BLIP2, whereas it was working before. I haven't had a chance to pin it down, but it might be a package version issue with something introducing a breaking change?

jeffliu-LL · 2023-12-14T19:20:46Z

Rolling back to peft=0.5.0 was able to get the blip2 example working for me

AntoniaSch · 2023-12-15T09:52:37Z

@jeffliu-LL which pytorch version are you using?

jeffliu-LL · 2023-12-19T19:21:18Z

pytorch 2.0.1 with pytorch-cuda 11.8

z3ugma · 2023-12-23T19:39:23Z

I will try rolling back to peft 0.5 with cuda 12.2 and Python 3.11.

Will report back

z3ugma · 2023-12-24T04:05:47Z

No, still a problem:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.1.2+cu121
Datasets 2.16.0
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

z3ugma · 2023-12-24T04:36:38Z

Still shows all nan after downgrading PyTorch and PEFT unfortunately

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.36.2
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

z3ugma · 2023-12-24T04:39:24Z

@jeffliu-LL will you put your versions of Python, PyTorch, Transformers, and cuda from the working environment?

z3ugma · 2023-12-24T04:53:26Z

Here are the packages from the working Google Colab environment:

Working: 
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
PEFT 0.5.0

z3ugma · 2023-12-24T19:15:10Z

Still not working on Python 3.10
some other version details of another nonworking version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
PEFT 0.5.0
SciPy 1.11.4
Pillow 9.4.0

bit-bcilab · 2024-01-11T10:12:03Z

seems like we meet the same problem :(

OS: Windows 10
CUDA: 11.8
Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] on win32

Torch 2.1.2+cu118
Datasets 2.16.1
Transformers 4.36.2
PEFT 0.7.1
bitsandbytes 0.41.0

bit-bcilab · 2024-01-11T10:13:11Z

wushandinghua · 2024-01-20T14:00:14Z

@z3ugma I meet the similar issue.The loss change to nan after epoch 0. Have you fix it?
dataset: jpawan33/kag100-image-captioning-dataset
pytorch:1.13.0
cuda:11.3
python: 3.9
PEFT: 0.7.2.dev0
transformers: 4.36.2

z3ugma · 2024-01-21T18:54:36Z

@wushandinghua no, I've not yet had success

pribadihcr · 2024-03-05T15:28:47Z

Any solutions? have same problem

eddie221 · 2024-03-06T07:27:47Z

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

pribadihcr · 2024-03-06T12:03:19Z

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

shams2023 · 2024-03-13T08:27:51Z

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 此笔记本： https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

在 Google Colab 上训练效果良好：https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW ?usp=sharing#scrollTo=upI97XEH6EKe

使用Python 3.10.12、Torch 2.1.0

它不在我的工作站上进行训练 - 损失在几个时期后就崩溃为 NaN：
Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan
我的工作站也使用 Python 3.10.13 和 Torch 2.1.0。是什么导致损失全部为nan？

Google Colab

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab.
Thank you very much!

shams2023 · 2024-03-13T08:29:21Z

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab.
Thank you !

eddie221 · 2024-03-13T09:00:17Z

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

Sorry for the late reply.
I also change the model type from torch.float16 to torch.float32
There will need to be two modifications to the code:

model = Blip2ForConditionalGeneration.from_pretrained("ybelkada/blip2-opt-2.7b-fp16-sharded", device_map="auto", load_in_8bit=True, torch_dtype=torch.float32)
pixel_values = batch.pop("pixel_values").to(device, torch.float32)
Here is the notebook of my testing result:
https://colab.research.google.com/drive/1j2jey-OqmtUa3IcI1kOcswWAWmiG4JKJ?usp=sharing

eddie221 · 2024-03-13T09:03:34Z

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !

I test with the same code as the https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb on my computer.

shams2023 · 2024-03-13T09:14:24Z

您能否向我发送一份在本地部署您的代码的副本？我也想在我自己的电脑上尝试一下，而不是使用 Google Colab。谢谢！

我使用与计算机上 https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 相同的代码进行测试。

Okay, I will deploy it to my own PyCharm for experimentation

z3ugma mentioned this issue Dec 5, 2023

FineTuning BLIP2 - various issues huggingface/peft#376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454

Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454

z3ugma commented Dec 4, 2023

z3ugma commented Dec 4, 2023

triangle959 commented Dec 13, 2023

triangle959 commented Dec 13, 2023

jeffliu-LL commented Dec 14, 2023

jeffliu-LL commented Dec 14, 2023

AntoniaSch commented Dec 15, 2023

jeffliu-LL commented Dec 19, 2023

z3ugma commented Dec 23, 2023

z3ugma commented Dec 24, 2023

z3ugma commented Dec 24, 2023 •

edited

Loading

z3ugma commented Dec 24, 2023 •

edited

Loading

z3ugma commented Dec 24, 2023

z3ugma commented Dec 24, 2023

bit-bcilab commented Jan 11, 2024 •

edited

Loading

bit-bcilab commented Jan 11, 2024

wushandinghua commented Jan 20, 2024

z3ugma commented Jan 21, 2024

pribadihcr commented Mar 5, 2024

eddie221 commented Mar 6, 2024 •

edited

Loading

pribadihcr commented Mar 6, 2024 •

edited

Loading

shams2023 commented Mar 13, 2024

shams2023 commented Mar 13, 2024 •

edited

Loading

eddie221 commented Mar 13, 2024

eddie221 commented Mar 13, 2024

shams2023 commented Mar 13, 2024

Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454

Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454

Comments

z3ugma commented Dec 4, 2023

z3ugma commented Dec 4, 2023

triangle959 commented Dec 13, 2023

triangle959 commented Dec 13, 2023

jeffliu-LL commented Dec 14, 2023

jeffliu-LL commented Dec 14, 2023

AntoniaSch commented Dec 15, 2023

jeffliu-LL commented Dec 19, 2023

z3ugma commented Dec 23, 2023

z3ugma commented Dec 24, 2023

z3ugma commented Dec 24, 2023 • edited Loading

z3ugma commented Dec 24, 2023 • edited Loading

z3ugma commented Dec 24, 2023

z3ugma commented Dec 24, 2023

bit-bcilab commented Jan 11, 2024 • edited Loading

bit-bcilab commented Jan 11, 2024

wushandinghua commented Jan 20, 2024

z3ugma commented Jan 21, 2024

pribadihcr commented Mar 5, 2024

eddie221 commented Mar 6, 2024 • edited Loading

pribadihcr commented Mar 6, 2024 • edited Loading

shams2023 commented Mar 13, 2024

shams2023 commented Mar 13, 2024 • edited Loading

eddie221 commented Mar 13, 2024

eddie221 commented Mar 13, 2024

shams2023 commented Mar 13, 2024

z3ugma commented Dec 24, 2023 •

edited

Loading

z3ugma commented Dec 24, 2023 •

edited

Loading

bit-bcilab commented Jan 11, 2024 •

edited

Loading

eddie221 commented Mar 6, 2024 •

edited

Loading

pribadihcr commented Mar 6, 2024 •

edited

Loading

shams2023 commented Mar 13, 2024 •

edited

Loading