Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Publish models after training if published_keys is set in CheckpointHook #987

Merged
merged 57 commits into from
Mar 29, 2023
Merged

[Feature] Publish models after training if published_keys is set in CheckpointHook #987

merged 57 commits into from
Mar 29, 2023

Conversation

KerwinKai
Copy link
Contributor

Motivation

As #905 described.

Modification

[Enhancement] add automatically Publish in checkpointhook.py and corresponding update document in hook.md

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

For example:

I am training cifar10 using the resnet50 model from the code in the mmengine example docs. And add default_hook in runner

default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=1,
                                     save_best='accuracy', rule='less', published_keys=['meta', 'state_dict']))
runner = Runner(
    model=MMResNet50(),
    work_dir='./work_dir',
    train_dataloader=train_dataloader,
    optim_wrapper=dict(optimizer=dict(type=SGD, lr=0.001, momentum=0.9)),
    train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
    val_dataloader=val_dataloader,
    val_cfg=dict(),
    val_evaluator=dict(type=Accuracy),
    default_hooks=default_hooks,
)

After trainning will publish the model, here is logs:

Loads checkpoint by local backend from path: /data/run01/scz0b6e/mm-lab/run_place/work_dir/epoch_5.pth
03/08 15:55:18 - mmengine - INFO - Load checkpoint from /data/run01/scz0b6e/mm-lab/run_place/work_dir/epoch_5.pth
03/08 15:55:18 - mmengine - INFO - Key `message_hub` will be removed because it is not in save_keys. If you want to keep it, please set `message_hub` in published_keys
03/08 15:55:18 - mmengine - INFO - Key `optimizer` will be removed because it is not in save_keys. If you want to keep it, please set `optimizer` in published_keys
03/08 15:55:18 - mmengine - INFO - The published model is saved at /data/run01/scz0b6e/mm-lab/run_place/work_dir/epoch_5-d16194b7.pth.
Loads checkpoint by local backend from path: /data/run01/scz0b6e/mm-lab/run_place/work_dir/best_accuracy_epoch_1.pth
03/08 15:55:18 - mmengine - INFO - Load checkpoint from /data/run01/scz0b6e/mm-lab/run_place/work_dir/best_accuracy_epoch_1.pth
03/08 15:55:18 - mmengine - INFO - Key `message_hub` will be removed because it is not in save_keys. If you want to keep it, please set `message_hub` in published_keys
03/08 15:55:19 - mmengine - INFO - The published model is saved at /data/run01/scz0b6e/mm-lab/run_place/work_dir/best_accuracy_epoch_1-107ccaf7.pth.

Here is the resulting folder structure

(base) [scz0b6e@ln01 run_place]$ cd work_dir/
(base) [scz0b6e@ln01 work_dir]$ ll
total 1302396
drwxrwxr-x 3 scz0b6e scz0b6e      4096 Mar  8 15:50 20230308_155014
-rw-rw-r-- 1 scz0b6e scz0b6e         1 Mar  8 15:50 20230308_155014.py
-rw-rw-r-- 1 scz0b6e scz0b6e 102497337 Mar  8 15:55 best_accuracy_epoch_1-107ccaf7.pth
-rw-rw-r-- 1 scz0b6e scz0b6e 102670911 Mar  8 15:51 best_accuracy_epoch_1.pth
-rw-rw-r-- 1 scz0b6e scz0b6e 204935167 Mar  8 15:51 epoch_1.pth
-rw-rw-r-- 1 scz0b6e scz0b6e 205061629 Mar  8 15:52 epoch_2.pth
-rw-rw-r-- 1 scz0b6e scz0b6e 205186557 Mar  8 15:53 epoch_3.pth
-rw-rw-r-- 1 scz0b6e scz0b6e 205311485 Mar  8 15:54 epoch_4.pth
-rw-rw-r-- 1 scz0b6e scz0b6e 102497337 Mar  8 15:55 epoch_5-d16194b7.pth
-rw-rw-r-- 1 scz0b6e scz0b6e 205436221 Mar  8 15:55 epoch_5.pth
-rw-rw-r-- 1 scz0b6e scz0b6e        57 Mar  8 15:55 last_checkpoint

Use follow code to check and compare the keys

import torch
def process_checkpoint(in_file):
    print(in_file)
    checkpoint = torch.load(in_file, map_location='cpu')
    ckpt_keys = list(checkpoint.keys())
    print(ckpt_keys)
    print()

def main():
    process_checkpoint('/HOME/scz0b6e/run/mm-lab/run_place/work_dir/best_accuracy_epoch_1.pth')
    process_checkpoint('/HOME/scz0b6e/run/mm-lab/run_place/work_dir/best_accuracy_epoch_1-107ccaf7.pth')
    process_checkpoint('/HOME/scz0b6e/run/mm-lab/run_place/work_dir/epoch_5.pth')
    process_checkpoint('/HOME/scz0b6e/run/mm-lab/run_place/work_dir/epoch_5-d16194b7.pth')

if __name__ == '__main__':
    main()

And here is the log:

/HOME/scz0b6e/run/mm-lab/run_place/work_dir/best_accuracy_epoch_1.pth
['meta', 'state_dict', 'message_hub']

/HOME/scz0b6e/run/mm-lab/run_place/work_dir/best_accuracy_epoch_1-107ccaf7.pth
['meta', 'state_dict']

/HOME/scz0b6e/run/mm-lab/run_place/work_dir/epoch_5.pth
['meta', 'state_dict', 'message_hub', 'optimizer']

/HOME/scz0b6e/run/mm-lab/run_place/work_dir/epoch_5-d16194b7.pth
['meta', 'state_dict']

The full logs: https://github.com/KerwinKai/mmengine/blob/main/20230308_155014.log

@CLAassistant
Copy link

CLAassistant commented Mar 8, 2023

CLA assistant check
All committers have signed the CLA.

To avoid `mypy` warning `mmengine/hooks/checkpoint_hook.py:358: error: Unsupported right operand type for in ("Optional[List[str]]") Found 1 error in 1 file (checked 224 source files)`
Try to avoid trim trailing whitespace waring in hook.md
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
KerwinKai and others added 10 commits March 8, 2023 19:29
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
docs/en/tutorials/hook.md Outdated Show resolved Hide resolved
KerwinKai and others added 2 commits March 8, 2023 19:37
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
@zhouzaida
Copy link
Collaborator

Hi, it would be better to also update docs/zh_cn/tutorials/hook.md.

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Mar 19, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@f356b3c). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 533ac42 differs from pull request most recent head 28499ad. Consider uploading reports for the commit 28499ad to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #987   +/-   ##
=======================================
  Coverage        ?   76.59%           
=======================================
  Files           ?      139           
  Lines           ?    11068           
  Branches        ?     2219           
=======================================
  Hits            ?     8478           
  Misses          ?     2219           
  Partials        ?      371           
Flag Coverage Δ
unittests 76.59% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
KerwinKai and others added 5 commits March 24, 2023 16:43
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
mmengine/hooks/checkpoint_hook.py Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
docs/en/tutorials/hook.md Outdated Show resolved Hide resolved
zhouzaida
zhouzaida previously approved these changes Mar 28, 2023
mmengine/hooks/checkpoint_hook.py Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
mmengine/hooks/checkpoint_hook.py Outdated Show resolved Hide resolved
Co-authored-by: Mashiro <57566630+HAOCHENYE@users.noreply.github.com>
@zhouzaida zhouzaida changed the title [Enhancement] Automatically Publish [Feature] Publish models after training Mar 28, 2023
@zhouzaida zhouzaida changed the title [Feature] Publish models after training [Feature] Publish models after training if published_keys is set in CheckpointHook Mar 29, 2023
@zhouzaida zhouzaida merged commit 5b35c5b into open-mmlab:main Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants