Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix split config proto and session job set #3637

Merged

Conversation

lixinqi
Copy link
Contributor

@lixinqi lixinqi commented Oct 3, 2020

InitGlobalSession这个接口在eager模式下不应该被调用,因为它只是为lazy准备的。

@@ -58,9 +58,7 @@ def InterpretScope(session, function_desc, config_proto):
job_conf.job_name = function_desc.job_func.__name__
placement_scope = function_desc.function_attribute.default_placement_scope
if placement_scope is None:
tag_and_dev_ids = placement_util.GetDefaultMachineDeviceIds(
oneflow.env.current_resource()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来的代码肯定是不对的。没有出错的原因是lazy的情形下env.current_resource就是返回Global<Resource, ForSession>,但在eager下,Global<Resource, ForSession>并没有得到特别的维护。

Comment on lines -89 to -90
#define OF_BARRIER_ALL() Global<CtrlClient>::Get()->Barrier(FILE_LINE_STR)
#define OF_BARRIER() \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重命名使得语义更佳

@@ -114,31 +114,31 @@ def IsSessionInited():
return oneflow_internal.IsSessionInited()


def InitGlobalSession(config_proto):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重命名相关api,使之不滥用

@chengtbf
Copy link
Contributor

chengtbf commented Oct 3, 2020

我们最上层的API、抽象 定义混乱,且有耦合,之后需要好好梳理一下

Env、Session(应该有专门的C++ Session概念)、Resource、MachineCtx、GlobalXXXScope ... 等等概念,有哪些是分布式环境最原始的信息,需要在OneFlow分布式环境启动时就该创建的(EnvScope & Ctrl & Transport);有哪些是随着每次 job编译/ Eager vm等生命周期启动再创建 (Session / Complier / Runtime ...); 做好概念的分级 和 各种Global的new/delete 生命周期管理。

@@ -34,7 +34,6 @@ class JobBuildAndInferCtxMgr {
Maybe<JobBuildAndInferCtx*> FindJobBuildAndInferCtx(const std::string& job_name);
Maybe<std::string> GetCurrentJobName() const;
Maybe<void> CloseCurrentJobBuildAndInferCtx();
Maybe<void> AddLbiAndDiffWatcherUuidPair(const LbiAndDiffWatcherUuidPair& lbi_uuid_pair) const;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数放在JobBuildAndInferCtxMgr是不合适的。应该属于JobBuildAndInferCtx的接口

@@ -89,7 +88,6 @@ Maybe<void> SessionGlobalObjectsScope::Init(const ConfigProto& config_proto) {
Global<CriticalSectionDesc>::New();
Global<InterUserJobInfo>::New();
Global<LazyJobBuildAndInferCtxMgr>::New();
Global<LbiDiffWatcherInfo>::New();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个Global也是非必须的。

@@ -78,7 +78,7 @@ EpollCommNet::~EpollCommNet() {
LOG(INFO) << "CommNet Thread " << i << " finish";
pollers_[i]->Stop();
}
OF_BARRIER();
OF_SESSION_BARRIER();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epoll_comm_network 在重构,让它和job, plan 等无关,也应该是和session无关的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个宏的实现用到了Global<ResourceDesc, ForSession>,它目前就是和Session相关。当然它有问题,重构的时候必须把它改过来

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加个注释解释一下,再加个TODO()

@@ -66,7 +66,7 @@ void IBVerbsCommNet::RegisterMemoryDone() {
.second);
}
}
OF_BARRIER();
OF_SESSION_BARRIER();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也是

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CommNetwork未来要重构成ENV级别的,跟CtrlServer和CtrlClient同一级别,在它们初始化之后就可以初始化。

@oneflow-ci-bot oneflow-ci-bot merged commit 4d3e2d3 into master Oct 4, 2020
@oneflow-ci-bot oneflow-ci-bot deleted the bugfix_split_config_proto_and_session_job_set branch October 4, 2020 05:39
liujuncheng pushed a commit that referenced this pull request Jun 3, 2021
* rename OF_BARRIAER

* add eager_2node_test.py

* InitLazyGlobalSession if eager execution not enabled

* remove Global<LbiDiffWatcherInfo>

* add TODO() comments for OF_SESSION_BARRIER under directory core/comm_network/

* remove eager_2node_test.py

Former-commit-id: 4d3e2d3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants