Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoMatedTest support test module.parameter.grad #6043

Merged
merged 24 commits into from
Aug 26, 2021

Conversation

wyg1997
Copy link
Contributor

@wyg1997 wyg1997 commented Aug 25, 2021

  • 增强 AutoMatedTest 的功能,使之比较 module.parameter.grad 的值
  • 修复 bn 层 grad 计算不对齐的问题
  • 修复 randperm 创建 0shape tensor 的 bug,并修复对应的单测
  • 修复 BatchNorm cpu 计算不对齐的问题

TODO:

  • bn 增加 cpu kernel,不在 python 上拼计算
  • 自动测试给出哪个 module 的哪个参数没对齐的提示
  • num_batches_tracked 做进 bn 的 UserOp 内
  • Normalization 需要支持 weight 和 bias 为 None 的情况,需要修改 functor

Comment on lines +417 to 423
dual_objects_to_test.append(
GetDualObject(
"unused",
getattr(x.pytorch, key).grad,
getattr(x.oneflow, key).grad,
)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加参数梯度的对比

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leaves-zwx 在这里加了对 module 参数梯度的对比,才能找到 momentum 没对齐时,参数更新错的问题

@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 25, 2021

目前 bn 的 cpu 实现还没有对齐,torch 的计算公式为:

  /// Collect the linear and constant terms regarding the input.
  /// output(n, c, h, w)
  ///     = (input(n, c, h, w) - mean(c)) / sqrt(var(c) + eps) * weight(c)
  ///         + bias(c)
  ///     = input(n, c, h, w) * inv_var(c) * weight(c)
  ///         - mean(c) * inv_var(c) * weight(c) + bias(c),
  /// where inv_var(c) = 1 / sqrt(var(c) + eps).
  /// So the linear term, alpha(c) = inv_var(c) * weight(c),
  ///   the constant term beta(c) = bias(c) - mean(c) * inv_var(c) * weight(c)
  /// Note that this is only a good idea if (input_size >> c), in degenerate
  /// cases where image_size == 1 && batch_size == 1, it is slow.

@hjchen2
Copy link
Contributor

hjchen2 commented Aug 25, 2021

目前 bn 的 cpu 实现还没有对齐,torch 的计算公式为:

  /// Collect the linear and constant terms regarding the input.
  /// output(n, c, h, w)
  ///     = (input(n, c, h, w) - mean(c)) / sqrt(var(c) + eps) * weight(c)
  ///         + bias(c)
  ///     = input(n, c, h, w) * inv_var(c) * weight(c)
  ///         - mean(c) * inv_var(c) * weight(c) + bias(c),
  /// where inv_var(c) = 1 / sqrt(var(c) + eps).
  /// So the linear term, alpha(c) = inv_var(c) * weight(c),
  ///   the constant term beta(c) = bias(c) - mean(c) * inv_var(c) * weight(c)
  /// Note that this is only a good idea if (input_size >> c), in degenerate
  /// cases where image_size == 1 && batch_size == 1, it is slow.

我们cpu和这个计算公式应该就是对齐的吧,后面那个等式可能会造成精度差异。

@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 25, 2021

我们cpu和这个计算公式应该就是对齐的吧,后面那个等式可能会造成精度差异。

我又核对了一下,是 running_mean 和 running_var 错了

self.__setattr__("running_mean", running_mean)
self.__setattr__("running_var", running_var)
# use unbiased variance to update running_var
unbiased_variance = x.var(dim=reduce_axis, unbiased=True, keepdim=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

更新 running_var 时用了无偏估计,后面计算的时候用的是真正的方差

@hjchen2 hjchen2 self-requested a review August 25, 2021 13:25
@wyg1997 wyg1997 requested a review from oneflow-ci-bot August 25, 2021 13:36
@BBuf
Copy link
Contributor

BBuf commented Aug 25, 2021

自动测试给出哪个 module 的哪个参数没对齐的提示,这个是打算怎么做呢?

@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 25, 2021

自动测试给出哪个 module 的哪个参数没对齐的提示,这个是打算怎么做呢?

游离的 tensor 不好办,module 里的参数都带名字的,这个在创建比较集合的时候就把名字传进去,对比出错打印的时候可以打出来

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 25, 2021 14:38
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 25, 2021 17:00
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 26, 2021 08:37
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 26, 2021 09:15
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 26, 2021 09:15
@MARD1NO
Copy link
Contributor

MARD1NO commented Aug 26, 2021

需不需要给batchnorm加入其他参数如 affine 的测试呢?

如果affine为False,这里是运行有错的,要给 functor的gamma和 beta 设置为Optional

@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 26, 2021

需不需要给batchnorm加入其他参数如 affine 的测试呢?

如果affine为False,这里是运行有错的,要给 functor的gamma和 beta 设置为Optional

这里就要functor 支持了,我记个 TODO 另外提一个 PR 来改

@@ -158,6 +164,8 @@ def forward(self, x):
else:
if self.training:
is_training = True
if self.track_running_stats:
self.num_batches_tracked += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不能写 += 。。。。。因为会触发 Inplace Add,推导 Consistent SBP 有 BUG

@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 26, 2021 10:27
@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 26, 2021 11:16
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 26, 2021 11:52
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 26, 2021 11:52
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 26, 2021 14:31
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 26, 2021 16:03
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 141.7ms (= 7087.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 128.2ms (= 6412.2ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.11 (= 141.7ms / 128.2ms)

PyTorch resnet50 time: 83.8ms (= 4192.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.6ms (= 3731.3ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.12 (= 83.8ms / 74.6ms)

PyTorch resnet50 time: 62.4ms (= 3118.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.4ms (= 2371.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.32 (= 62.4ms / 47.4ms)

PyTorch resnet50 time: 47.9ms (= 2396.6ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 39.3ms (= 1963.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.22 (= 47.9ms / 39.3ms)

PyTorch resnet50 time: 43.7ms (= 2182.6ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 33.4ms (= 1672.5ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 1.31 (= 43.7ms / 33.4ms)

@oneflow-ci-bot oneflow-ci-bot merged commit 7fccccf into master Aug 26, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the feat-autotest_param_grad branch August 26, 2021 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants