Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using preprocess_phi_3_new in LAVIS/open_flamingo/train/sft_data_utils.py gets labels all -100. #776

Open
JHW5981 opened this issue Dec 26, 2024 · 4 comments

Comments

@JHW5981
Copy link

JHW5981 commented Dec 26, 2024

Hello, thank you for your wonderful work.

I have a problem re-implementing LazySupervisedDataset and am stuck at the position of retrieving training labels. All labels are -100.

image

Below is a screenshot of my dataset:

image

I completely reuse your LazySupervisedDataset. When I initialize data_path, tokenizer, image_processor, and args, it runs without any issues. However, when I check the labels it generates, the tensor is entirely -100.

I debugged this strange behavior and found that the issue occurs because of the following piece of code:

image

First, when the if-clause above reaches the “user round,” the cur_len is absolutely not equal to total_len, so the line target[:] = IGNORE_INDEX is always executed.

image

Second, the code at line 226 does not skip the bos token but instead skips the "<|user|>" token. I don’t understand the reasoning behind this behavior.

image
@JHW5981
Copy link
Author

JHW5981 commented Dec 26, 2024

@azshue

@JHW5981
Copy link
Author

JHW5981 commented Dec 27, 2024

Not sure if it was an oversight by the author, but I believe these two lines of code should be commented out.
image

@azshue
Copy link
Collaborator

azshue commented Dec 27, 2024

Hi @JHW5981,

Thank you for trying out with our code.

In my local environment, the current code works with the latest phi3-mini model/tokenizer. The commented code in line 225-226 was working with a previous version of phi3 tokenizer (the said phi3 model update).

Could you check if your local phi3 model is up-to-date?

@JHW5981
Copy link
Author

JHW5981 commented Dec 27, 2024

@azshue Thank you for your response.

Replacing the tokenizer with the newest Phi-3 tokenizer does not solve my problem.

I downloaded the weights from Salesforce/xgen-mm-phi3-mini-instruct-r-v1 and used the following code to load the tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my local weight dir")

To update the tokenizer, I replaced the files 'special_tokens_map.json', 'tokenizer_config.json', 'tokenizer.json', and 'tokenizer.model' with those from microsoft/Phi-3-mini-4k-instruct. After reloading the tokenizer, the issue persists.

I wonder if lines 225-266 in the code are essential. On line 221, the <|assistant|> token is already added. If lines 225-266 are not commented out, the <|assistant|> token is not masked. I believe this behavior is not consistent with how user-defined tokens should normally be masked.

image

By the way, do you know why using the tokenizer to convert text to ids does not add special tokens, even when I explicitly set add_special_tokens=True?😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants