Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fea/nn graph/graph build ctx #5420

Merged
merged 76 commits into from
Jul 13, 2021
Merged

Conversation

strint
Copy link
Contributor

@strint strint commented Jul 7, 2021

Context for graph build in nn.Graph._compile()

  • LazyMode::is_enabled() and LazyMode::Gard()
  • LazyMode controls interpreter and job_build_and_infer_ctx_mgr
  • MultiClient execution decorator and init MutiClientSession
  • job_build_and_infer_context enter & exit
  • global scope
  • 下个pr module scope
  • 下个pr parameter & buffer scope

@strint strint mentioned this pull request Jul 13, 2021
6 tasks
@@ -160,18 +160,18 @@ Maybe<AttrVal> MakeCppAttrValueFromProtoOrCfgAttrValue(const ProtoT& cfg_attr_va
}
}

/*static*/ Maybe<AttrVal> AttrValueUtil::ToCppAttrValue(const AttrValue& proto_attr_value) {
/* static */ Maybe<AttrVal> AttrValueUtil::ToCppAttrValue(const AttrValue& proto_attr_value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥有这么多的格式调整?之前的 pr 没有 format 这里吗?现在 format 不是自动的吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种注释format没有检查,代码里面有两种风格,统一到了和tf一样的风格

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种注释不适合放在这个位置吧,只见过放在函数参数边上的,用来提示这个参数是哪个,比如:

ToCppAttrValue(/* proto_attr_value */ attr)

函数开头写个 /* static */ 看不明白是要表达什么

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是google的代码风格,静态成员函数的实现在函数开头加注释/* static */,表明这个函数是类的静态成员函数

@@ -49,6 +54,10 @@ def name(self):
def training(self):
return self.config.training

@property
def _graph_proto(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 property 不对外暴露吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

暂时只是为了我们自己开发调试,没想好是否要开放,所以暂时写成私有的

session.TryInit()

with graph_build_util.graph_build_context(self.config.proto, session):
outputs = self.build(*args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里掉用 build 前后是不是有一些 todo

  • parameter block 转换成 variable op
  • input eager tensor 转换成 input op 并返回 lazy tensor

还有这里返回的 outputs 后面用的着吗?

Copy link
Contributor Author

@strint strint Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里掉用 build 前后是不是有一些 todo

  • parameter block 转换成 variable op
  • input eager tensor 转换成 input op 并返回 lazy tensor

没错,要依赖成诚的pr补全这块,所以这里都还没处理,只是为其准备好执行的graph_build_context

还有这里返回的 outputs 后面用的着吗?

按现在的设计,用得着,返回的lazy tensor,会再走lazy interpreter运行一遍,添加output op到 job

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs 会被转换成 eager tensor,这里仍然需要构造 FetchOutputOpExpr 来实现。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs 转换成 eager tensor 是 launch 后做的,不是 compile 这里做的?

oneflow/init.py Outdated
@@ -75,6 +76,7 @@
session_ctx.OpenDefaultSession(
MultiClientSession(oneflow._oneflow_internal.NewSessionId())
)
scope_util.InitScopeStack()
Copy link
Contributor Author

@strint strint Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个init.py文件调试时,改动始终没有生效。。。

解决方式是要make 一下,把它copy到build目录,以同步改动,其它python文件不用。原因参见oneflow.cmake里面的python文件的copy和软链接这两种处理方式。 from @caishenghang

这里的改动不是本次pr的,为了解决冲突,先合并了 #5469

@chengtbf
Copy link
Contributor

这里是不是也可以close + open MultiClientSession。各个测试case之间隔离开,会更合理些

内部的 Close + Open 未来也可以提供,前提是 Multi-ClientSession 还需要记录当前 Session 中仍然存在的 NNGraph 对象,只有NNGraph 对象都下线了,才能 Close。 目前我自己觉得即使不隔离,也不会影响单测。

@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 13, 2021 12:40
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 13, 2021 13:24
@oneflow-ci-bot oneflow-ci-bot merged commit 67272d2 into master Jul 13, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fea/nn_graph/graph_build_ctx branch July 13, 2021 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants