Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行报错 #12

Open
Man1978-scd opened this issue Apr 10, 2023 · 1 comment
Open

运行报错 #12

Man1978-scd opened this issue Apr 10, 2023 · 1 comment

Comments

@Man1978-scd
Copy link

Man1978-scd commented Apr 10, 2023

当我使用torch1.10.0的时候
执行训练脚本
bash tools/dist_train.sh work_configs/tamper/tamper_convx_b_exp.py 2
报错如下:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [5, 512, 32, 32]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
报错位置处于mmcv库中runner/epoch_based_runner.py文件中
以为是版本问题
于是我降低torch到1.5,变成另外一个报错
RuntimeError: The size of tensor a (2) must match the size of tensor b (128) at non-singleton dimension 3
我不知道该怎么定位这个问题🤔,恳请作者提供requirements对应的版本,以及使用教程文档🙀

@Man1978-scd
Copy link
Author

已经解决🤮

在debug的时候,发现模型在使用混合精度优化时,用到mmcv.runner.hooks.optimizer.py中的模型权重拷贝的函数中

def copy_grads_to_fp32(self, fp16_net, fp32_weights):
    """Copy gradients from fp16 model to fp32 weight copy."""
    for fp32_param, fp16_param in zip(fp32_weights,
                                      fp16_net.parameters()):
        if fp16_param.grad is not None:
            if fp32_param.grad is None:
                fp32_param.grad = fp32_param.data.new(
                    fp32_param.size())
            fp32_param.grad.copy_(fp16_param.grad)

fp32_param fp16_param 的grad维度不一致导致拷贝失败,torch1.5 在返回 fp16_net.parameters 时一会返回weight部分的Tensor,一会又返回bias部分的Tensor ,导致维度不一致,我也是服了。最后升级到torch1.6正常运行🙀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant