-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NNGraph interface and implement for CompileAndRuntime #5558
Conversation
Global<CtrlClient>::Get()->ClearKV("plan"); | ||
if (GlobalProcessCtx::IsThisProcessMaster()) { | ||
// TODO(chengcheng): split plan for each rank. | ||
Global<CtrlClient>::Get()->PushKV("plan", plan_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个plan命名是不是得和graph的名字关了了,因为会有多个
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实不用。我们同一时间,只会有一个 plan 在 kv 中。所以上面会先 clear 一下,及时清理,不然 kv 里反复累积存 plan 也不好。但后面应该会根据不同的 rank 做切分,分成多个 kv。
|
||
Maybe<void> NNGraph::CompileAndRuntime() { | ||
JobBuildAndInferCtx* job_ctx = JUST(GetJobBuildAndInferCtx(name_)); | ||
job_ = job_ctx->job(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job_ 的同步和检查是在哪做?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还没做检查。直接按照 rank = 0 的 job 为准
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最终这里是要做检查的,但是要忽略一些信息:比如 scope symbol id 各个 rank 上的 id 可能并不一致(内容是要一致的,但是 symbol 无法同步)。 我们可能需要把 scope symbol 也要 proto 化(scope 的 proto),存到 job 里。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
或者去除掉 op conf 里的 symbol id,换成普通的 scope id 2 scope 之类的设计,这样我们的 job 才能不依赖定义它的 python 脚本,symbol id 这样的东西太依赖执行的上下文了
Speed stats:
|
提供 NNGraph 的 接口和实现:
TODO:
支持 多 job 的版本。