Description
Thank you for your code and engineering. I use the following code to fix the seed, which can be tested to fully reproduce the comparison experiment.(Because I have found that sometimes random replacement of seeds or incomplete fixation can result in a positive or negative 2% evaluation annotation, which is unacceptable in comparative experiments.)
I did some testing experiments and obtain the following results:1. After shutdown and restart, keep the same parameters to completely repeat the previous experiment;2. The same type of graphics card on the same server, single card training or the same number of multi-card training results are the same;3. The seed and the superparameter are the same, but the graphics card is different, and the final result is different;4. Different hardware+same graphics card, resulting in different results;5. The results of continuing training after interrupting the model training are different from those of the model that has been trained all the time (I guess it is related to the epoch and learning rate, sorry that I have not finished studying the relevant code).
I have tested multiple model families, including:resnet、mobilenet、efficientnet、efficientformer、vit、levit、xcit. However, I found that the efficientformerv2_s1 model was not completely fixed, and there were other factors in the code that prevented full reproducibility. When I tested the same graphics card on the same server, I found a slight difference in results during the first epoch; In addition, using the same graphics card for multiple experiments, a gap appeared in the second epoch of testing. I am doing some experiments and searching other articles to find the cause of the problem, but I have not determined it yet. Could you please help me find it?
I modified random.py in utils using the following code.
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] =str(seed)
torch.backends.cudnn.deterministic = True (Using these two will make training slower)
torch.backends.cudnn.benchmark = False