-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error around wandb_logger when using custom trained weight file as initial weights #3583
Comments
👋 Hello @totti0223, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com. RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
@totti0223 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. @AyushExel is our W&B expert who may have some insight also. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
In addition to the above requirements, for Ultralytics to provide assistance your code should be:
If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
@glenn-jocher @totti0223 Fix here - #3588 |
@totti0223 @AyushExel good news 😃! Your original issue may now be fixed ✅ in PR #3588. To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
@totti0223 can you please pull the latest changes and see if the problem is solved? |
hmm, i tried again after git pull and same error occurs. command error
i really do not know about wandb, however I've thought even though if the above will be solved, using the same run_id as that of pretrained weights would leed to cross contamination of different training sessions upon visualziations, IF it is a unique ID managed in wandb server. so I've just tested an alternative that at least work in my environment by just forcely assigning None to run_id. still dirty and don't know this is the best solution, so I'm just leaving it here for discussion. train.py
|
@totti0223 oaky thanks for letting me know. I'm trying an alternative fix for it. I'll send you my branch to check if the problem is solved. I cannot reproduce it on my end. Appreciate your help |
@totti0223 hey I guess the issue is that the fix got merged into
|
unfortunately, same config error message.
raise config_util.ConfigError( |
@totti0223 Okay thanks for doing this. I'm trying another fix for it. I'll share a branch with you to test shortly |
@totti0223 here's my develop branch https://github.com/AyushExel/yolov5/tree/develop |
@AyushExel @totti0223 we were experimenting with a develop branch before but have now returned to master for all PRs. @AyushExel I saw your update on develop and have now merged it into master in f8adee1, so this might resolve the issue. |
@glenn-jocher thanks. @totti0223 reported yesterday that this update didn't fix the problem for him. I've tried an explicit fix here. I'll make a PR once @totti0223 confirms if this fixed the problem |
@AyushExel ok understood. All new PRs should automatically merge to master in the future so the previous develop branch issue should no longer happen at least. |
@AyushExel Hi, the errors are gone, however as I expected, in the case unfavorable for transfer_learning, not resuming train, it contaminates the previous wandb sessions. we want a new session to be created but however with the identical run_id, it will be merged into the sessions which the initial weights are created (see green mark in below image). this is fine for resuming traing but not for independent training. I think we should additionally assign run_id as None in this case as the code I wrote above or further modification in your updates. |
@totti0223 Okay thanks for testing the fix. @glenn-jocher I think we'll need to explictily handle this use case. |
@totti0223 here's my develop branch https://github.com/AyushExel/yolov5/tree/develop P.S. I think it's not recommended to transfer learn from partially trained models. @glenn-jocher might be able to speak more on this. |
@AyushExel Thx for your reply.
just my personal experience and as you mentioned, may not be for everyone, but would be nice if possible. |
@AyushExel tested your latest branch and wandb issues resolved! |
@totti0223 YOLOv5 automatically handles class count differences, so you can train any dataset from any pretrained model, regardless of differing class counts, i.e. train 20 class VOC model from 80 class COCO model. No action is required on your part. @AyushExel @totti0223 there's 3 main recommended workflows:
That said though, many users are under the false impression that the --resume command can be used to extend your training and keep getting better results on a fully trained model. We've prepared an automatic reply that generally describes these concepts below.
|
@glenn-jocher just to confirm, a pre-trained model is a model that has been completely trained right? Which means that a pre-trained model should never have this problem because |
p.s. |
@AyushExel yes by pretrained model I mean a model that successfully completed training and was processed by the Lines 618 to 632 in f8adee1
@totti0223 yes you can train any pretrained model on any new dataset, there are no constraints. |
@glenn-jocher got it. So I think the current integration complains when users try to transfer learning from partially trained, unstripped models. Which is what is intended as a sanity check. |
HI, trying to catch up with the discussion and probably misunderstood of partially/ fully trained. I am not talking about resuming or using partially trained weights. One of my approach is starting with COCO pretrained weights to adapt to custom DatasetA, then afterward using the that fully trained weights to train another DatasetB in the future, where Dataset A and Dataset B shares closer domain similarity than to COCO, respectively, therefore starting from the DatasetA fully trained weights are generally favorable. (Dataset B may be a Further enlarged version of DatasetA). However if many users in this repo does not face the case of above and starting with the COCO trained weights in any case is almost the choise, the revision of the code disucussed do not need to be encorporated, as a little fix is only needed to grant such need. Will reclose this issue👍 |
@totti0223 your workflow details seem to be consistent with standard practice to me. 👍 |
@totti0223 a fully trained weight should not have
|
@AyushExel with your comment now i've finally got the point of _partially and fully training. this is because I always set the training epoch to large few hundreds and stop manually early by looking at the model metrics for saturation (in this case, the best.pt is fully trained in my context but is partially trained in yolov5 repo meaning). Guess in this case I shall manually apply strip_optimizer() function to that weight and problem solved. |
@totti0223 yes you can call strip_optimizer() on any yolov5.pt file at any point. Afterward it is not resumable with --resume, but you can start a new training on it with --weights. |
experienced in the latest repo version.
error info
runs fine
python train.py --img 640 --batch 8 --epochs 250 --data config.yaml --weights yolov5l --name my_train
encounts error prior to training
python train.py --img 640 --batch 8 --epochs 250 --data config.yaml --weights ./runs/train/my_train/weights/best.pt --name transfer_learning_train
temporary workaround
train.py
Note
Durring writing this issue, I noticed I have used the same config file name (config.yaml), although the content of the yaml is different between trains. Don't know whether that is the cause of the problem, so will try after my current train settles.
The text was updated successfully, but these errors were encountered: