Using `preprocess_phi_3_new` in `LAVIS/open_flamingo/train/sft_data_utils.py` gets labels all -100. #776

JHW5981 · 2024-12-26T17:32:37Z

Hello, thank you for your wonderful work.

I have a problem re-implementing LazySupervisedDataset and am stuck at the position of retrieving training labels. All labels are -100.

Below is a screenshot of my dataset:

I completely reuse your LazySupervisedDataset. When I initialize data_path, tokenizer, image_processor, and args, it runs without any issues. However, when I check the labels it generates, the tensor is entirely -100.

I debugged this strange behavior and found that the issue occurs because of the following piece of code:

First, when the if-clause above reaches the “user round,” the cur_len is absolutely not equal to total_len, so the line target[:] = IGNORE_INDEX is always executed.

Second, the code at line 226 does not skip the bos token but instead skips the "<|user|>" token. I don’t understand the reasoning behind this behavior.

JHW5981 · 2024-12-26T17:38:53Z

@azshue

JHW5981 · 2024-12-27T03:55:22Z

Not sure if it was an oversight by the author, but I believe these two lines of code should be commented out.

azshue · 2024-12-27T05:20:15Z

Hi @JHW5981,

Thank you for trying out with our code.

In my local environment, the current code works with the latest phi3-mini model/tokenizer. The commented code in line 225-226 was working with a previous version of phi3 tokenizer (the said phi3 model update).

Could you check if your local phi3 model is up-to-date?

JHW5981 · 2024-12-27T05:46:25Z

@azshue Thank you for your response.

Replacing the tokenizer with the newest Phi-3 tokenizer does not solve my problem.

I downloaded the weights from Salesforce/xgen-mm-phi3-mini-instruct-r-v1 and used the following code to load the tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my local weight dir")

To update the tokenizer, I replaced the files 'special_tokens_map.json', 'tokenizer_config.json', 'tokenizer.json', and 'tokenizer.model' with those from microsoft/Phi-3-mini-4k-instruct. After reloading the tokenizer, the issue persists.

I wonder if lines 225-266 in the code are essential. On line 221, the <|assistant|> token is already added. If lines 225-266 are not commented out, the <|assistant|> token is not masked. I believe this behavior is not consistent with how user-defined tokens should normally be masked.

By the way, do you know why using the tokenizer to convert text to ids does not add special tokens, even when I explicitly set add_special_tokens=True?😂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `preprocess_phi_3_new` in `LAVIS/open_flamingo/train/sft_data_utils.py` gets labels all -100. #776

Using `preprocess_phi_3_new` in `LAVIS/open_flamingo/train/sft_data_utils.py` gets labels all -100. #776

JHW5981 commented Dec 26, 2024 •

edited

Loading

JHW5981 commented Dec 26, 2024 •

edited

Loading

JHW5981 commented Dec 27, 2024

azshue commented Dec 27, 2024

JHW5981 commented Dec 27, 2024 •

edited

Loading

Using preprocess_phi_3_new in LAVIS/open_flamingo/train/sft_data_utils.py gets labels all -100. #776

Using preprocess_phi_3_new in LAVIS/open_flamingo/train/sft_data_utils.py gets labels all -100. #776

Comments

JHW5981 commented Dec 26, 2024 • edited Loading

JHW5981 commented Dec 26, 2024 • edited Loading

JHW5981 commented Dec 27, 2024

azshue commented Dec 27, 2024

JHW5981 commented Dec 27, 2024 • edited Loading

Using `preprocess_phi_3_new` in `LAVIS/open_flamingo/train/sft_data_utils.py` gets labels all -100. #776

Using `preprocess_phi_3_new` in `LAVIS/open_flamingo/train/sft_data_utils.py` gets labels all -100. #776

JHW5981 commented Dec 26, 2024 •

edited

Loading

JHW5981 commented Dec 26, 2024 •

edited

Loading

JHW5981 commented Dec 27, 2024 •

edited

Loading