-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training reproducibility improvements #8213
Conversation
for more information, see https://pre-commit.ci
@AyushExel thanks for the PR! Can you please provide before and after results, i.e. 3 runs from master and 3 runs from PR? The scenario can be small, i.e. COCO128 YOLOv5s 30 epochs, but it's important to compare changes to the current baseline. Thanks! |
|
@AyushExel got it, perfect! I will check out this branch and run full YOLOv5s COCO trainings on the 8 GPUs today. |
@glenn-jocher nice. You also have the same test for master branch already right? |
@AyushExel yes, these are the differences between min and max mAP@0.5:0.95:
|
@AyushExel seems to show identical variation to master at epoch 8 (about 0.4%), so seems like no change in randomness. What happened to the torch.use_deterministic_algorithms() that I sugested? |
@AyushExel also your dataloader init function seems to be lacking python and torch seed inits as in this example: https://discuss.pytorch.org/t/reproducibility-with-all-the-bells-and-whistles/81097 |
@glenn-jocher The dataloader init_fn only requires np random seed as mentioned in official pytorch issues here and here The torch.use_deterministic_algorithms() is not exception safe. There are many operations that'll just throw runtime error when this is enabled. Also to work for CUDA 10.2 and above correctly some environ variables need to be set or it'll cause runtime exceptions. More details here in the last section https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms |
@glenn-jocher Also after reading a lot of discussions on various platforms( github, kaggle, pytorch discourse) I haven't found anyone who has actually been able to accurately reproduce their large model experiments. All solutions are just there to reduce the variance |
@AyushExel ok I'm going to cancel this training and try new experiments:
|
@glenn-jocher I'll do the 3rd on this branch in some time. |
@AyushExel ok got it! --workers 0 experiment started in new project, tracking results in same comment #8213 (comment). Each epoch there takes 30 min so we'll have the epoch 10 results in about 5 hours. The good news is the randomness at epoch 10 is a great benchmark, no need to wait 300 epochs. |
@glenn-jocher I was just testing torch.use_deterministic_alg on my branch. The training ran successfully but there's one operation post training that throws runtime error. I'll test more to see if its actually deterministic. If so, we can change the implementation of the operation that throws error. If not, let's leave it alone.
|
@glenn-jocher keep an eye on https://wandb.ai/cayush/use_deter?workspace=user-cayush |
@glenn-jocher Okay so I've set the EDIT: It seems like additional seed inn dataloader init fn is not required. So leaving it as it is right now. It only affects DDP mode which I can't test locally |
@glenn-jocher You'll need to run these tests:
|
@AyushExel got it. Running 8 YOLOv5s now at https://wandb.ai/glenn-jocher/test-reproduce-pr2 |
@glenn-jocher great. How long does 1 epoch usually take? From the benchmark runs you posted above:
|
@glenn-jocher
|
@AyushExel yes I see this also. Losses are all identical but mAP is 0. Usually when mAP is 0 it's due to AMP/CUDA/Windows/Conda issues, but I've recently added AMP checks and these are passing for PR trainings in https://wandb.ai/glenn-jocher/test-reproduce-pr3 AMP checks run inference on pretrained model (or YOLOv5n downloaded model if not pretrained model) to verify that AMP inference and default inference produce similar results. This was added in #7917 and improved in #7937 |
@glenn-jocher No responses on the pytorch forum issue. I'm trying to debug this using pdb. Hopefully the bug is occurring during the calculation of maps with deterministic algorithms. I'll verify if the model is actually learning anything or not by plotting results in each epoch. |
@glenn-jocher The error is with calculation. I plotted BBoxDebugger for 1st epoch in VOC training and most objects are detected correctly so map shouldn't be 0. https://wandb.ai/cayush/use_deter_s/ |
@AyushExel I overlaid a master run against current PR: train losses, val losses, learning rates are all identical, but all metrics are zero. Very strange. Obviously the latest commit 254d379 caused this. Looking at the commit it has two changes, so let's try to isolate one change at a time to identify the cause. I'll comment out one line and retry a new training. |
I see what you're saying here. So this is good news, it means the models are actually learning and are identical, it's just the validation that seems problematic. But the validation is always deterministic anyways, it never varies so maybe we can set flags to enable/disable deterministic mode in val.py as a quick fix. |
@UnglvKitDe thanks for the feedback! I've made updates to only run the command once if torch 1.12 is installed. torch < 1.12 we'll leave alone. EDIT: Will run a new training today with these settings. |
@glenn-jocher @AyushExel I did 5 run with coco128. In one of 5 runs I get the 0 results again. A similar picture on my custom data (1 of 8 has the 0 problem again). Very strange. I have set up a clean conda installation with torch 1.12 and cuda 11.6. |
@UnglvKitDe this is not the zero mAP problem. zero mAP means zero mAP at all times. In your training your validation loses are unstable and increasing leading to logically low mAP. |
Tested PR in Colab with 1.12. Looks good, all 3 identical and high mAP. !git clone https://github.com/AyushExel/yolov5 -b init_seeds # clone
%cd yolov5
%pip install -qr requirements.txt torch==1.12 torchvision==0.13 # install
import torch
import utils
display = utils.notebook_init() # checks
# Train YOLOv5s on COCO128 for 10 epochs
!python train.py --img 640 --batch 16 --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --seed 0
!python train.py --img 640 --batch 16 --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --seed 0
!python train.py --img 640 --batch 16 --epochs 10 --data coco128.yaml --weights yolov5s.pt --cache --seed 0 |
Testing vs master on COCO in https://wandb.ai/glenn-jocher/test-reproduce-pr4 EDIT: Unable to test on Multi-GPU systems per torch 1.12 YOLOv5 bug in #8395 |
@glenn-jocher Then we talked about other issues. When I used |
W&B trainings have been cancelled because unable to test on Multi-GPU systems per torch 1.12 YOLOv5 bug in #8395 |
@AyushExel @glenn-jocher With torch 1.12 there is a issue with multi gpu training when I insert the reproducibility as described in the doc. For some reason, CUDA runs out of memory. On the master it works without problems (~5.1/12 GB VRAM). |
ok, so i tried debugging this today, but i don't understand why the problem occurs. if i use a different number of workers it works. unfortunately ( at the moment ) i can't recreate it with any public record.
ok, so i tried debugging the above problem today, but i don't understand why the problem occurs. if i use a different number of workers it works. unfortunately (as of now) i can't recreate it with any public dataset. Very strange. @glenn-jocher Have you ever seen such a problem? |
@UnglvKitDe it's not uncommon for gradient/training instabilities to lead to higher losses and diverged results. This is just a fact of life with nonlinear optimization problems. The reproducibility part we are trying to work on with this PR of course though. |
@AyushExel I think this PR is good to merge. I added a |
@AyushExel PR is merged! The new deterministic policy is that init_seeds() defaults to False but we pass true in train.py. init_seeds(opt.seed + 1 + RANK, deterministic=True) I also added init_seeds to classifier.py and observed deterministic behavior without having to set deterministic=True, but also test it with True and saw no errors (strange). Anyway I think we are done here and can move on to other things! |
* attempt at reproducibility * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use deterministic algs * fix everything :) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert dataloader changes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * process_batch as np * remove newline * Remove dataloader init fcn * Update val.py * Update train.py * revert additional changes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update train.py * Add --seed arg * Update general.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update train.py * Update train.py * Update val.py * Update train.py * Update general.py * Update general.py * Add deterministic argument to init_seeds() Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
* attempt at reproducibility * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use deterministic algs * fix everything :) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert dataloader changes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * process_batch as np * remove newline * Remove dataloader init fcn * Update val.py * Update train.py * revert additional changes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update train.py * Add --seed arg * Update general.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update train.py * Update train.py * Update val.py * Update train.py * Update general.py * Update general.py * Add deterministic argument to init_seeds() Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
good job |
@lijiajun3029 thank you! 🙏 This is a team effort and your valuable feedback and testing have been instrumental in improving YOLOv5. We're always here if you have more questions or need further assistance. |
Followed suggestions from:
pytorch/pytorch#7068 (comment)
https://www.mldawn.com/reproducibility-in-pytorch/
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Introduced a global training seed option for enhanced reproducibility.
📊 Key Changes
--seed
command-line argument to specify the global training seed.init_seeds
method to accept adeterministic
parameter and implemented deterministic behavior when activated.init_seeds
to use PyTorch'suse_deterministic_algorithms()
and set theCUBLAS_WORKSPACE_CONFIG
environment variable for PyTorch versions >= 1.12.0.🎯 Purpose & Impact