Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] add trainer factory #489

Merged
merged 1 commit into from
Mar 8, 2023
Merged

Conversation

geniuspatrick
Copy link
Collaborator

@geniuspatrick geniuspatrick commented Mar 7, 2023

Thank you for your contribution to the MindCV repo.
Before submitting this PR, please make sure:

Motivation

We abstracted the fragment of the training script about creating a mindspore.Model (actually a Trainer) into the function create_trainer, a factory method of creating trainers. We believe this abstraction facilitates code readability.

When creating the trainer we need to pass in common components such as networks, optimizer, loss functions etc. as input parameters. Meanwhile, we also need to pass in additional parameters to support Auto Mixed Precision(AMP). Next, we elaborate on the creator's design and principles regarding AMP in details.

  1. The level of amp:
    We follow the definition from MindSpore
    We may subsequently consider customized black and white lists

  2. The type of loss scale:

    • fixed(w/ or w/o drop_overflow_update)
    • dynamic
    • auto

    For fixed or dynamic, we will explicitly construct the LossScaleManager and pass in mindspore.Model. For auto, we will not actively construct the LossScaleManager, but mindspore.Model may use the LossScaleManager silently, see mindspore.train.amp for details.

  3. The value of loss scale
    We will raise an error if the value of loss scale is less than 1. We no longer make a special case for loss_scale=1, although for a while we did. You can now set the type of loss scale to auto to achieve the same effect.

We have also considered support for customized TrainStep. The current customized TrainStep supports EMA and gradient clipping. Note: the current customized TrainStep can only be used with fixed loss scale without dropping overflow!

Test Plan

st is already in tests/tasks/test_train_val_imagenet_subset.py. Do we need an additional unit test?

Related Issues and PRs

Nope

loss_scale_value=loss_scale, scale_factor=2, scale_window=2000
)
else:
raise ValueError(f"Loss scale type only support ['fixed', 'dynamic'], but got{loss_scale_type}.")
Copy link
Collaborator

@Songyuanwei Songyuanwei Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的loss scale type和上面的不一样,不支持auto,用户会产生歧义吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同样需要自定义的TrainStep支持auto

# instead of cell, and TrainStep should be TrainOneStepCell. If drop_overflow_update is True,
# scale_sense should be FixedLossScaleUpdateCell, and TrainStep should be TrainOneStepWithLossScaleCell.
train_step_kwargs["scale_sense"] = nn.FixedLossScaleUpdateCell(loss_scale_value=loss_scale)
elif loss_scale_type.lower() == "dynamic":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_step中目前仅有不检测溢出的情况,如果设置成dynamic loss scale, train_step中应该要检测溢出,但是目前中代码没有检测溢出。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面加了loss scale的限制

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants