-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
Conversation
CI failed, removing label automerge |
Speed stats:
|
Speed stats:
|
本地测试日志:
|
@@ -236,7 +236,12 @@ void PlanUtil::GenMemBlockAndChunkWithVariableOpNames4Plan( | |||
.op_conf(); | |||
if (!op_conf.has_variable_conf()) { return false; } | |||
const std::string& var_name = op_conf.name(); | |||
if (variable_op_names.find(var_name) == variable_op_names.end()) { return false; } | |||
if (variable_op_names.find(var_name) == variable_op_names.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是可以发现那些没有对应eager tensor的variable op对吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,所以我会弹 warning 出来。 目前理论上是没有的,所有的 Variable op 都应该找到对应的 eager tensor
} else if (var_conf.initializer().has_constant_int_conf()) { | ||
value = var_conf.initializer().constant_int_conf().value(); | ||
} else { | ||
OF_UNIMPLEMENTED(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个分支到不了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我怕后面还需要支持更多的 initializer (比如不是 constant),这里先把 if 都写全
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Speed stats:
|
重要功能 和 BUG 修复:
NNGraph 支持为 JobPass 里创建的 Variable Conf :
已在本地验证过 AMP 单卡和多卡(2卡)的训练正常
该PR 解决了 AMP 训练 DynamicLossScale 初始值为 0 的问题(原本应为 2^30)。
之前没有为这些 JobPass 中的 Variable 创建 Tensor 并初始化,但恰巧绝大多数这类的 Variable (比如 BN 的 model.bn1.weight-momentum)的 initializer 都是 constant 0 ,所以没有引发实际训练的问题。本 PR 修复此 BUG。
TODO: