Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNGraph interface and implement for CompileAndRuntime #5558

Merged
merged 6 commits into from
Jul 22, 2021

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Jul 21, 2021

提供 NNGraph 的 接口和实现:

  • register_input_op_names
  • register_output_op_names
  • register_variable_op_names_and_tensors
  • complie_and_runtime

TODO:
支持 多 job 的版本。

Global<CtrlClient>::Get()->ClearKV("plan");
if (GlobalProcessCtx::IsThisProcessMaster()) {
// TODO(chengcheng): split plan for each rank.
Global<CtrlClient>::Get()->PushKV("plan", plan_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个plan命名是不是得和graph的名字关了了,因为会有多个

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实不用。我们同一时间,只会有一个 plan 在 kv 中。所以上面会先 clear 一下,及时清理,不然 kv 里反复累积存 plan 也不好。但后面应该会根据不同的 rank 做切分,分成多个 kv。


Maybe<void> NNGraph::CompileAndRuntime() {
JobBuildAndInferCtx* job_ctx = JUST(GetJobBuildAndInferCtx(name_));
job_ = job_ctx->job();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job_ 的同步和检查是在哪做?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还没做检查。直接按照 rank = 0 的 job 为准

Copy link
Contributor Author

@chengtbf chengtbf Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最终这里是要做检查的,但是要忽略一些信息:比如 scope symbol id 各个 rank 上的 id 可能并不一致(内容是要一致的,但是 symbol 无法同步)。 我们可能需要把 scope symbol 也要 proto 化(scope 的 proto),存到 job 里。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者去除掉 op conf 里的 symbol id,换成普通的 scope id 2 scope 之类的设计,这样我们的 job 才能不依赖定义它的 python 脚本,symbol id 这样的东西太依赖执行的上下文了

@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 22, 2021 08:38
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 142.7ms (= 7134.7ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.2ms (= 6312.4ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 142.7ms / 126.2ms)

PyTorch resnet50 time: 89.3ms (= 4466.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 73.8ms (= 3688.7ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.21 (= 89.3ms / 73.8ms)

PyTorch resnet50 time: 58.3ms (= 2912.9ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.9ms (= 2396.1ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.22 (= 58.3ms / 47.9ms)

PyTorch resnet50 time: 48.3ms (= 2414.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 43.7ms (= 2186.9ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.10 (= 48.3ms / 43.7ms)

PyTorch resnet50 time: 39.6ms (= 1980.2ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 44.0ms (= 2199.3ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.90 (= 39.6ms / 44.0ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review July 22, 2021 11:02
@oneflow-ci-bot oneflow-ci-bot merged commit 47e9e66 into master Jul 22, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the dev_cc_NNGraph_compile_runtime branch July 22, 2021 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants