Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error around wandb_logger when using custom trained weight file as initial weights #3583

Closed
totti0223 opened this issue Jun 11, 2021 · 29 comments · Fixed by #3588
Closed

error around wandb_logger when using custom trained weight file as initial weights #3583

totti0223 opened this issue Jun 11, 2021 · 29 comments · Fixed by #3588
Labels
bug Something isn't working

Comments

@totti0223
Copy link

experienced in the latest repo version.

error info

  • runs fine
    python train.py --img 640 --batch 8 --epochs 250 --data config.yaml --weights yolov5l --name my_train

  • encounts error prior to training
    python train.py --img 640 --batch 8 --epochs 250 --data config.yaml --weights ./runs/train/my_train/weights/best.pt --name transfer_learning_train

File "train.py", line 543, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 72, in train
    wandb_logger = WandbLogger(opt, save_dir.stem, run_id, data_dict)
  File "/home2/synthetic_seeds/yolov5/utils/wandb_logging/wandb_utils.py", line 121, in __init__
    self.wandb_run.config.opt = vars(opt)
  File "/home/dl-box/anaconda3/envs/yolo5/lib/python3.8/site-packages/wandb/sdk/wandb_config.py", line 139, in __setitem__
    key, val = self._sanitize(key, val)
  File "/home/dl-box/anaconda3/envs/yolo5/lib/python3.8/site-packages/wandb/sdk/wandb_config.py", line 231, in _sanitize
    raise config_util.ConfigError(
wandb.sdk.lib.config_util.ConfigError: Attempted to change value of key "opt" from {'cfg': '', 'hyp': {'box': 0.05, 'cls': 0.5, 'lr0': 0.01, 'lrf': 0.2, 'obj': 1, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'iou_t': 0.2, 'mixup': 0, 'scale': 0.5, 'shear': 0, 'cls_pw': 1, 'fliplr': 0.5, 'flipud': 0, 'mosaic': 1, 'obj_pw': 1, 'degrees': 0, 'anchor_t': 4, 'fl_gamma': 0, 'momentum': 0.937, 'translate': 0.1, 'perspective': 0, 'weight_decay': 0.0005, 'warmup_epochs': 3, 'warmup_bias_lr': 0.1, 'warmup_momentum': 0.8}, 'adam': False, 'data': 'config.yaml', 'name': 'my_train', 'quad': False, 'rect': False, 'bucket': '', 'device': '', 'entity': None, 'epochs': 250, 'evolve': False, 'nosave': False, 'notest': False, 'resume': False, 'project': 'runs/train', 'sync_bn': False, 'weights': 'yolov5l.pt', 'workers': 8, 'exist_ok': False, 'img_size': [640, 640], 'save_dir': 'runs/train/my_train', 'linear_lr': False, 'batch_size': 8, 'local_rank': -1, 'single_cls': False, 'world_size': 1, 'global_rank': -1, 'multi_scale': False, 'save_period': -1, 'cache_images': False, 'noautoanchor': False, 'bbox_interval': -1, 'image_weights': False, 'artifact_alias': 'latest', 'upload_dataset': False, 'label_smoothing': 0, 'total_batch_size': 8} to {'weights': './runs/train/my_train/weights/best.pt', 'cfg': '', 'data': 'config.yaml', 'hyp': {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}, 'epochs': 250, 'batch_size': 8, 'img_size': [640, 640], 'rect': False, 'resume': False, 'nosave': False, 'notest': False, 'noautoanchor': False, 'evolve': False, 'bucket': '', 'cache_images': False, 'image_weights': False, 'device': '', 'multi_scale': False, 'single_cls': False, 'adam': False, 'sync_bn': False, 'local_rank': -1, 'workers': 8, 'project': 'runs/train', 'entity': None, 'name': 'transfer_learning_train', 'exist_ok': False, 'quad': False, 'linear_lr': False, 'label_smoothing': 0.0, 'upload_dataset': False, 'bbox_interval': -1, 'save_period': -1, 'artifact_alias': 'latest', 'world_size': 1, 'global_rank': -1, 'save_dir': 'runs/train/transfer_learning_train', 'total_batch_size': 8}
If you really want to do this, pass allow_val_change=True to config.update()

temporary workaround

train.py

#run_id = torch.load(weights).get('wandb_id') if weights.endswith('.pt') and os.path.isfile(weights) else None
run_id = None

Note

Durring writing this issue, I noticed I have used the same config file name (config.yaml), although the content of the yaml is different between trains. Don't know whether that is the cause of the problem, so will try after my current train settles.

@totti0223 totti0223 added the bug Something isn't working label Jun 11, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jun 11, 2021

👋 Hello @totti0223, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 11, 2021

@totti0223 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

@AyushExel is our W&B expert who may have some insight also.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@AyushExel
Copy link
Contributor

@glenn-jocher @totti0223 Fix here - #3588

@glenn-jocher glenn-jocher linked a pull request Jun 11, 2021 that will close this issue
@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 11, 2021

@totti0223 @AyushExel good news 😃! Your original issue may now be fixed ✅ in PR #3588. To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@AyushExel
Copy link
Contributor

@totti0223 can you please pull the latest changes and see if the problem is solved?

@totti0223
Copy link
Author

@AyushExel

hmm, i tried again after git pull and same error occurs.

command
python train.py --img 1280 --batch 16 --epochs 250 --data config.yaml --weights ./runs/train/210613_1280_small2/weights/best.pt --name test

error

wandb: ERROR Attempted to change value of key "opt" from {'cfg': '', 'hyp': {'box':...... wandb: ERROR If you really want to do this, pass allow_val_change=True to config.update() .... Traceback (most recent call last): File "train.py", line 543, in <module> train(hyp, opt, device, tb_writer) File "train.py", line 72, in train wandb_logger = WandbLogger(opt, save_dir.stem, run_id, data_dict) File "/home2/synthetic_seeds/yolov5/utils/wandb_logging/wandb_utils.py", line 129, in __init__ self.wandb_run.config.opt = vars(opt) File "/home/dl-box/anaconda3/envs/yolo5/lib/python3.8/site-packages/wandb/sdk/wandb_config.py", line 139, in __setitem__ key, val = self._sanitize(key, val) File "/home/dl-box/anaconda3/envs/yolo5/lib/python3.8/site-packages/wandb/sdk/wandb_config.py", line 231, in _sanitize raise config_util.ConfigError(wandb.sdk.lib.config_util.ConfigError: Attempted to change value of key "opt" from {'cfg': '', 'hyp': {'box': 0.05, 'cls': 0.5, 'lr0': 0.01, 'lrf': 0.2, 'obj': 1, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'iou_t': 0.2, 'mixup': 0,

i really do not know about wandb, however I've thought even though if the above will be solved, using the same run_id as that of pretrained weights would leed to cross contamination of different training sessions upon visualziations, IF it is a unique ID managed in wandb server.

so I've just tested an alternative that at least work in my environment by just forcely assigning None to run_id. still dirty and don't know this is the best solution, so I'm just leaving it here for discussion.

train.py
line 72

assets = ['yolov5s.pt', 'yolov5m.pt', 'yolov5l.pt', 'yolov5x.pt',
          'yolov5s6.pt', 'yolov5m6.pt', 'yolov5l6.pt', 'yolov5x6.pt']
if weights in assets:
    # the torch.load(weights).get('wandb_id') against pretrained weights in repo returns None anyway.
    run_id = None
elif weights.endswith('.pt') and os.path.isfile(weights) and opt.resume:
    # resuming training, using the same run_id
    run_id = torch.load(weights).get('wandb_id')
else:
    # transfer learning using homebrew pt file.
    run_id = None

@totti0223 totti0223 reopened this Jun 13, 2021
@AyushExel
Copy link
Contributor

@totti0223 oaky thanks for letting me know. I'm trying an alternative fix for it. I'll send you my branch to check if the problem is solved. I cannot reproduce it on my end. Appreciate your help

@AyushExel
Copy link
Contributor

@totti0223 hey I guess the issue is that the fix got merged into deveop branch and not master
can you check using the develop branch? Here's the steps.

  • git clone https://github.com/ultralytics/yolov5.git --branch develop
  • copy the weight from your current copy of the repo to this clone. this file -> /runs/train/210613_1280_small2/weights/best.pt
  • And then run the same command. Let me know if you still see the error.

@totti0223
Copy link
Author

unfortunately, same config error message.

git branch --contains

  • develop

python train.py --img 1280 --batch 16 --epochs 250 --data config.yaml --weights ../../yolov5/runs/train/210613_1280_small2/weights/best.pt --name test

raise config_util.ConfigError(
wandb.sdk.lib.config_util.ConfigError: Attempted to change value of ke................
wandb: Waiting for W&B process to finish, PID 24157
wandb: Program failed with code 1. Press ctrl-c to abort syncing.

@AyushExel
Copy link
Contributor

@totti0223 Okay thanks for doing this. I'm trying another fix for it. I'll share a branch with you to test shortly

@AyushExel
Copy link
Contributor

@totti0223 here's my develop branch https://github.com/AyushExel/yolov5/tree/develop
I've added explicit config.update(...,allow_val_change=True) and tested it on my machine. Resuming from checkpoint works fine. But I didn't face this problem before. So, it'd be great if you can confirm the problem is fixed :)

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 14, 2021

@AyushExel @totti0223 we were experimenting with a develop branch before but have now returned to master for all PRs.

@AyushExel I saw your update on develop and have now merged it into master in f8adee1, so this might resolve the issue.

@AyushExel
Copy link
Contributor

@glenn-jocher thanks. @totti0223 reported yesterday that this update didn't fix the problem for him. I've tried an explicit fix here. I'll make a PR once @totti0223 confirms if this fixed the problem

@glenn-jocher
Copy link
Member

@AyushExel ok understood. All new PRs should automatically merge to master in the future so the previous develop branch issue should no longer happen at least.

@totti0223
Copy link
Author

@AyushExel Hi, the errors are gone, however as I expected, in the case unfavorable for transfer_learning, not resuming train, it contaminates the previous wandb sessions. we want a new session to be created but however with the identical run_id, it will be merged into the sessions which the initial weights are created (see green mark in below image). this is fine for resuming traing but not for independent training. I think we should additionally assign run_id as None in this case as the code I wrote above or further modification in your updates.

image

@AyushExel
Copy link
Contributor

@totti0223 Okay thanks for testing the fix. @glenn-jocher I think we'll need to explictily handle this use case.
Current behaviour - Check for wandb_id and resume a run if it exists
Expected behaviour - Check for wandb_id and resume a run if it exists and --resume flag is used
This might fix the problem with transfer learning with custom trained models.
Also, I think most users don't face this problem with transfer learning because they use official stripped YOLOv5 models which don't have wandb_id. Which makes me wonder, do you want to allow transfer learning from model which are not completely trained, i.e, the models that are not stripped?

@AyushExel
Copy link
Contributor

AyushExel commented Jun 14, 2021

@totti0223 here's my develop branch https://github.com/AyushExel/yolov5/tree/develop
I've added another update. The run_id will only be used if --resume flag is set, it will be ignored otherwise. Does this solution work for your use case? Is there anything that you think can be improved? Please test this whenever you're free. Appriciate your active involvement in this process :)

P.S. I think it's not recommended to transfer learn from partially trained models. @glenn-jocher might be able to speak more on this.

@totti0223
Copy link
Author

@AyushExel Thx for your reply.

  1. In my case the dataset category gradually increases iteratively afterwards so would like to use the recent trained weights for transfer_learning (accurately fine tuning in this case) instead of starting from provided weights again. Or in more general, changing the output label number with the same dataset for different usage. Don't know for yolov5, but experienced using own weights is faster in other object detection training.
  2. Thought anyone would like to have their own fully trained weights from other datasets for transfer learning startpoint? For domain specialized tasks (like biological images), starting from custom weights are favorable.

just my personal experience and as you mentioned, may not be for everyone, but would be nice if possible.

@totti0223
Copy link
Author

@AyushExel tested your latest branch and wandb issues resolved!
now we can choose from official provided weights/ custom weights in addition to resuming training.
:)

image

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 14, 2021

@totti0223 YOLOv5 automatically handles class count differences, so you can train any dataset from any pretrained model, regardless of differing class counts, i.e. train 20 class VOC model from 80 class COCO model. No action is required on your part.

@AyushExel @totti0223 there's 3 main recommended workflows:

  1. train from scratch: python train.py --weights '' --cfg yolov5s.yaml
  2. train from pretrained: python train.py --weights yolov5s.pt
  3. resume an interrupted training python train.py --resume optional/path/to/last.pt

That said though, many users are under the false impression that the --resume command can be used to extend your training and keep getting better results on a fully trained model. We've prepared an automatic reply that generally describes these concepts below.

👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

LR Curves

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.launch --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck and let us know if you have any other questions!

@AyushExel
Copy link
Contributor

@glenn-jocher just to confirm, a pre-trained model is a model that has been completely trained right? Which means that a pre-trained model should never have this problem because wandb_id should not be present in it

@totti0223
Copy link
Author

@AyushExel Thx for your reply.

  1. In my case the dataset category gradually increases iteratively afterwards so would like to use the recent trained weights for transfer_learning (accurately fine tuning in this case) instead of starting from provided weights again. Or in more general, changing the output label number with the same dataset for different usage. Don't know for yolov5, but experienced using own weights is faster in other object detection training.
  2. Thought anyone would like to have their own fully trained weights from other datasets for transfer learning startpoint? For domain specialized tasks (like biological images), starting from custom weights are favorable.

just my personal experience and as you mentioned, may not be for everyone, but would be nice if possible.

p.s.
not only the label but the dataset itself increases.

@glenn-jocher
Copy link
Member

@AyushExel yes by pretrained model I mean a model that successfully completed training and was processed by the strip_optimizer() function at the end of training. The official models and the custom models are handled exactly the same way, i.e. there's nothing special about the official models in the release assets, they are simply trained from scratch on COCO using the default training commands.

yolov5/utils/general.py

Lines 618 to 632 in f8adee1

def strip_optimizer(f='best.pt', s=''): # from utils.general import *; strip_optimizer()
# Strip optimizer from 'f' to finalize training, optionally save as 's'
x = torch.load(f, map_location=torch.device('cpu'))
if x.get('ema'):
x['model'] = x['ema'] # replace model with ema
for k in 'optimizer', 'training_results', 'wandb_id', 'ema', 'updates': # keys
x[k] = None
x['epoch'] = -1
x['model'].half() # to FP16
for p in x['model'].parameters():
p.requires_grad = False
torch.save(x, s or f)
mb = os.path.getsize(s or f) / 1E6 # filesize
print(f"Optimizer stripped from {f},{(' saved as %s,' % s) if s else ''} {mb:.1f}MB")

@totti0223 yes you can train any pretrained model on any new dataset, there are no constraints.

@AyushExel
Copy link
Contributor

@glenn-jocher got it. So I think the current integration complains when users try to transfer learning from partially trained, unstripped models. Which is what is intended as a sanity check.
Nevertheless, I have made a PR that removes this check -> #3604

@totti0223
Copy link
Author

HI, trying to catch up with the discussion and probably misunderstood of partially/ fully trained. I am not talking about resuming or using partially trained weights.

One of my approach is starting with COCO pretrained weights to adapt to custom DatasetA, then afterward using the that fully trained weights to train another DatasetB in the future, where Dataset A and Dataset B shares closer domain similarity than to COCO, respectively, therefore starting from the DatasetA fully trained weights are generally favorable. (Dataset B may be a Further enlarged version of DatasetA).

However if many users in this repo does not face the case of above and starting with the COCO trained weights in any case is almost the choise, the revision of the code disucussed do not need to be encorporated, as a little fix is only needed to grant such need. Will reclose this issue👍

@glenn-jocher
Copy link
Member

@totti0223 your workflow details seem to be consistent with standard practice to me. 👍

@AyushExel
Copy link
Contributor

@totti0223 a fully trained weight should not have wandb_id present in it so you should not see this error.
Are you using it like this:

  • First train on DatasetA using python train.py --epochs 5 --data DatasetA.yaml. Let this train completely.
  • Then train on datasetB using python train.py --epochs 5 --data DatasetB.yaml --weights path_to_best.pt_from_datasetA

@totti0223
Copy link
Author

@AyushExel with your comment now i've finally got the point of _partially and fully training.
so the wandb_id is removed from both the best.pt and last.pt only if the training finishes.

this is because I always set the training epoch to large few hundreds and stop manually early by looking at the model metrics for saturation (in this case, the best.pt is fully trained in my context but is partially trained in yolov5 repo meaning).

Guess in this case I shall manually apply strip_optimizer() function to that weight and problem solved.

cf.
#294
#1633

@glenn-jocher
Copy link
Member

@totti0223 yes you can call strip_optimizer() on any yolov5.pt file at any point. Afterward it is not resumable with --resume, but you can start a new training on it with --weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants