Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in Model Performance When Reproducing Experiment #32

Open
fannie1208 opened this issue Oct 16, 2024 · 4 comments
Open

Differences in Model Performance When Reproducing Experiment #32

fannie1208 opened this issue Oct 16, 2024 · 4 comments

Comments

@fannie1208
Copy link

fannie1208 commented Oct 16, 2024

Hi, thank you for your nice work!

I'm reproducing the results in Table 2, using Mistral-7B model on MMLU and TydiQA and select 5% data.

image

I adhere to the scripts in your repo to conduct the warmup, data selection and training, and use the evaluation code in your repo to evaluate. I do not change any settings in your script, though only use a random seed of 3.

Despite following these settings, the performance of my model is worse than the results in Table 2.
For MMLU, the performance of Random is 58.3 (60.0 in your paper), LESS is 60.8 (61.8 in your paper).
For TydiQA, the f1 of Random is 44.6, LESS is 55.1.

My environments are: torch 2.4.0, transformers 4.45.2, peft 0.13.1, datasets 3.0.1

Are these differences reasonable? Could you please confirm if the settings in your scripts are fully aligned with those used in your paper?

Thanks.

@cafeii
Copy link

cafeii commented Oct 19, 2024

I'm not the author. I'm also reproducing this paper.

I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?).

Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

@fannie1208
Copy link
Author

I'm not the author. I'm also reproducing this paper.

I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?).

Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

Hi, I used their script for calculating gradients but delete '--max_samples 200' so that it will build a gradient datastore for all data samples.

Besides, for the batchsize, I used the default setting in their scripts and got the ckpt 422, 845, 1268, 1688.

@cafeii
Copy link

cafeii commented Oct 20, 2024

I'm not the author. I'm also reproducing this paper.
I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?).
Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

Hi, I used their script for calculating gradients but delete '--max_samples 200' so that it will build a gradient datastore for all data samples.

Besides, for the batchsize, I used the default setting in their scripts and got the ckpt 422, 845, 1268, 1688.

I guess you probably run the script in a single GPU.

The author reported the bz=128, therefore the total steps of one epoch of 4 datasets should be 105, which is the CKPT=105 in the step 2 tutorial. According to the default settings in the scripts (per_device_train_batch_size=1 and gradient_accumulation_steps=32), they probably run these experiments in a 4-GPU environment. If you run it in a single GPU, the default total batch size is 32, which may make the result worse.

@fannie1208
Copy link
Author

I'm not the author. I'm also reproducing this paper.
I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?).
Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

Hi, I used their script for calculating gradients but delete '--max_samples 200' so that it will build a gradient datastore for all data samples.
Besides, for the batchsize, I used the default setting in their scripts and got the ckpt 422, 845, 1268, 1688.

I guess you probably run the script in a single GPU.

The author reported the bz=128, therefore the total steps of one epoch of 4 datasets should be 105, which is the CKPT=105 in the step 2 tutorial. According to the default settings in the scripts (per_device_train_batch_size=1 and gradient_accumulation_steps=32), they probably run these experiments in a 4-GPU environment. If you run it in a single GPU, the default total batch size is 32, which may make the result worse.

Yes, I run it on a single GPU but I think only the batchsize shouldn't affect the result so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants