Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: test convert dependency #6023

Merged
merged 48 commits into from
Sep 7, 2021
Merged

add: test convert dependency #6023

merged 48 commits into from
Sep 7, 2021

Conversation

hhhfccz
Copy link
Contributor

@hhhfccz hhhfccz commented Aug 24, 2021

tvm 和 oneflow_convert_tool 需要通过graph获取每一个节点的shape和dtype
目前对repr(garph)的依赖:

  1. types = ["INPUT", "PARAMETER", "BUFFER", "OUTPUT"]

其中input和output应该为计算图的i/o, 命名类似_OneFlowGraph0-input_0
其中buffer类似batchnorm算子中的running_mean/var

同时对flow.load的返回值有依赖:

  1. 需要能够提取每层参数对应的路径,在tvm转换中需要依赖路径信息进行节点配对

@CLAassistant
Copy link

CLAassistant commented Aug 24, 2021

CLA assistant check
All committers have signed the CLA.

@flow.unittest.skip_unless_1n1d()
class TestConvertDependency(flow.unittest.TestCase):
def test_get_params(test_case):
model_dir_path = "alexnet_oneflow_model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个路径ci是默认有的吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个需要下载预训练的模型参数

@BBuf BBuf requested a review from strint August 24, 2021 09:33

p_size = re.compile(r"size=\(.*?\)", re.S)
p_type = re.compile(r"dtype=.*?,", re.S)
types = ["INPUT", "PARAMETER", "BUFFER", "OUTPUT"]
Copy link
Contributor

@strint strint Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nn.Graph的input和output有这些类型可能出现:Tensor、None、TensorTuple、List[Tensor]
这里只考虑了Tensor?

不过repr里面的确把TensorTuple、List[Tensor]展开成Tensor了,参考这个pr:#5803

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,这里的“通过graph获取每一个节点的shape和dtype”,repr这里只有graph和module级别的,没有op级别的,也不影响?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里主要是获取到每一个节点的信息,在之前是可以通过job.helper获取的,但是现在helper好像是None。这里的input没有考虑None的情况,在转到tvm的过程中当input没有的时候在转换过程中会直接报错。关于op级别的节点信息在转换过程中会从repr(graph)解析出来的信息中提取,应该没有影响。

)
)
if not graph._is_compiled:
_ = graph._compile(flow.rand(shape_input))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compile后面比如我们转为public接口,改成 compile,怎么处理,提示要match oneflow的版本?

Copy link
Contributor

@strint strint Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5.0及之前我们以这个测试作为接口约定。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compile后面比如我们转为public接口,改成 compile,怎么处理,提示要match oneflow的版本?

请问一下_compile转为compile是在本周内完成的吗,如果比较快的话这个以及后面的部分(获取所有nodes)可以先省略,等graph开发完全了再提一个PR补回来

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

短期内不改,在graph各种训练功能稳定后,才考虑把这个改为public接口。你可以赖现在这个。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

if size_attr[-2] == ",":
size_attr = size_attr.replace(",", "")
if type_attr[-1] == ",":
type_str = type_attr.replace(",", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个检查有点弱,只能保证哟内容,最好检查下内容是对的。

@@ -0,0 +1,105 @@
"""
Copy link
Contributor

@strint strint Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_xx_convert_dependency.py

xx最好明确下

或者叫 test_api_dependency_on_graph.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,这个脚本tvm转换和onnx转换都有用到所以一开始没有做区分

@hhhfccz
Copy link
Contributor Author

hhhfccz commented Aug 25, 2021

@strint

  1. 更改了测试脚本命名
  2. 对提取的dtype作了float32的检查(关于shape内容的检查我目前没有想到比较好的方案)
  3. 对alexnet的params个数作了限制,这边提取的params是16个
  4. 有一个可能的问题,因为在TVM转换的过程中,转换batchnorm算子会使用到BUFFER,目前测试用的模型是alexnet没有涉及到这方面。我想问一下之前被标记为BUFFER的节点在今后的Graph中会被怎么处理,会标成PARAMETER吗

@strint
Copy link
Contributor

strint commented Aug 25, 2021

@strint

  1. 更改了测试脚本命名

好的

  1. 对提取的dtype作了float32的检查(关于shape内容的检查我目前没有想到比较好的方案)

是不是可以选取一个tensor,写死对它的检查就好

  1. 对alexnet的params个数作了限制,这边提取的params是16个

好的

  1. 有一个可能的问题,因为在TVM转换的过程中,转换batchnorm算子会使用到BUFFER,目前测试用的模型是alexnet没有涉及到这方面。我想问一下之前被标记为BUFFER的节点在今后的Graph中会被怎么处理,会标成PARAMETER吗

可以自己构造一个module,里面注册一个:

self.register_buffer("dummy_buff", flow.Tensor(1, 4))  # 比如自己注册一个buffer,既可以验证buffer,又可以检查tensor shape

参见:oneflow/python/oneflow/test/graph/test_graph.py
在repr中是这样的:

(BUFFER:m.dummy_buff:tensor(..., size=(1, 4), dtype=oneflow.float32)): ()   

resnet50中的bn

    (MODULE:resnet50.bn1:BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)): (                                                                                                
      (PARAMETER:resnet50.bn1.weight:tensor(...,                                                                                                                                                            
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32,                                                                                                                               
             requires_grad=True)): ()                                                                                                                                                                       
      (PARAMETER:resnet50.bn1.bias:tensor(...,                                                                                                                                                              
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32,                                                                                                                               
             requires_grad=True)): ()                                                                                                                                                                       
      (BUFFER:resnet50.bn1.running_mean:tensor(...,                                                                                                                                                         
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32)): ()                                                                                                                          
      (BUFFER:resnet50.bn1.running_var:tensor(...,                                                                                                                                                          
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32)): ()                                                                                                                          
    )                                                               

@hhhfccz
Copy link
Contributor Author

hhhfccz commented Aug 26, 2021

@strint

  1. 更改了测试脚本命名

好的

  1. 对提取的dtype作了float32的检查(关于shape内容的检查我目前没有想到比较好的方案)

是不是可以选取一个tensor,写死对它的检查就好

  1. 对alexnet的params个数作了限制,这边提取的params是16个

好的

  1. 有一个可能的问题,因为在TVM转换的过程中,转换batchnorm算子会使用到BUFFER,目前测试用的模型是alexnet没有涉及到这方面。我想问一下之前被标记为BUFFER的节点在今后的Graph中会被怎么处理,会标成PARAMETER吗

可以自己构造一个module,里面注册一个:

self.register_buffer("dummy_buff", flow.Tensor(1, 4))  # 比如自己注册一个buffer,既可以验证buffer,又可以检查tensor shape

参见:oneflow/python/oneflow/test/graph/test_graph.py
在repr中是这样的:

(BUFFER:m.dummy_buff:tensor(..., size=(1, 4), dtype=oneflow.float32)): ()   

resnet50中的bn

    (MODULE:resnet50.bn1:BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)): (                                                                                                
      (PARAMETER:resnet50.bn1.weight:tensor(...,                                                                                                                                                            
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32,                                                                                                                               
             requires_grad=True)): ()                                                                                                                                                                       
      (PARAMETER:resnet50.bn1.bias:tensor(...,                                                                                                                                                              
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32,                                                                                                                               
             requires_grad=True)): ()                                                                                                                                                                       
      (BUFFER:resnet50.bn1.running_mean:tensor(...,                                                                                                                                                         
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32)): ()                                                                                                                          
      (BUFFER:resnet50.bn1.running_var:tensor(...,                                                                                                                                                          
             placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0, 1]}, hierarchy=(2,)),                                                                                              
             sbp=(oneflow.sbp.broadcast,), size=(64,), dtype=oneflow.float32)): ()                                                                                                                          
    )                                                               

谢谢你的建议 @strint

  1. 添加了对buffer的检查
  2. 添加了对alexnet第一层conv2d.weights的检查和对最后一层linear.weights的检查
  3. 添加了获取nodes之后,对node attribute的提取检查

Copy link
Contributor

@strint strint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@BBuf
Copy link
Contributor

BBuf commented Aug 26, 2021

可以改一下名字:test_api_dependency_on_graph.py -> test_tvm_fronted_api_dependency_on_graph.py

@hhhfccz
Copy link
Contributor Author

hhhfccz commented Aug 26, 2021

可以改一下名字:test_api_dependency_on_graph.py -> test_tvm_fronted_api_dependency_on_graph.py

好的,改好了

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 26, 2021 11:52
@BBuf BBuf added the eager label Aug 26, 2021
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 29, 2021 17:15
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot self-requested a review September 6, 2021 13:52
@oneflow-ci-bot oneflow-ci-bot removed their request for review September 6, 2021 14:17
@oneflow-ci-bot oneflow-ci-bot self-requested a review September 6, 2021 14:17
@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2021

CI failed, removing label automerge

@github-actions github-actions bot removed the automerge label Sep 6, 2021
@oneflow-ci-bot oneflow-ci-bot removed their request for review September 6, 2021 15:18
@oneflow-ci-bot oneflow-ci-bot self-requested a review September 6, 2021 16:14
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 7, 2021 01:49
@github-actions
Copy link
Contributor

github-actions bot commented Sep 7, 2021

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.5ms (= 6423.1ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.4ms (= 7070.6ms / 50, input_shape=[16, 3, 224, 224])
Relative speed: 1.10 (= 141.4ms / 128.5ms)

OneFlow resnet50 time: 74.7ms (= 3734.7ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.3ms (= 4164.5ms / 50, input_shape=[8, 3, 224, 224])
Relative speed: 1.12 (= 83.3ms / 74.7ms)

OneFlow resnet50 time: 47.5ms (= 2374.0ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.9ms (= 3046.1ms / 50, input_shape=[4, 3, 224, 224])
Relative speed: 1.28 (= 60.9ms / 47.5ms)

OneFlow resnet50 time: 39.1ms (= 1955.9ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 50.0ms (= 2501.4ms / 50, input_shape=[2, 3, 224, 224])
Relative speed: 1.28 (= 50.0ms / 39.1ms)

OneFlow resnet50 time: 34.4ms (= 1720.7ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 44.7ms (= 2232.5ms / 50, input_shape=[1, 3, 224, 224])
Relative speed: 1.30 (= 44.7ms / 34.4ms)

OneFlow resnet50 time: 152.6ms (= 7628.3ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.7ms (= 8134.4ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
Relative speed: 1.07 (= 162.7ms / 152.6ms)

OneFlow resnet50 time: 100.9ms (= 5047.5ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.8ms (= 5192.2ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
Relative speed: 1.03 (= 103.8ms / 100.9ms)

OneFlow resnet50 time: 78.0ms (= 3899.2ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 81.4ms (= 4069.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
Relative speed: 1.04 (= 81.4ms / 78.0ms)

OneFlow resnet50 time: 68.7ms (= 3436.5ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.0ms (= 3601.1ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
Relative speed: 1.05 (= 72.0ms / 68.7ms)

OneFlow resnet50 time: 68.0ms (= 3399.7ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 61.5ms (= 3076.3ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
Relative speed: 0.90 (= 61.5ms / 68.0ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 7, 2021 03:02
@oneflow-ci-bot oneflow-ci-bot merged commit 4b3dc88 into master Sep 7, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the convert_dependency branch September 7, 2021 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants