-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi client launch #5372
multi client launch #5372
Conversation
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
…f_multi_devices Signed-off-by: daquexian <daquexian566@gmail.com>
…/oneflow into new_if_multi_devices
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
…new_if_multi_devices
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
oneflow/init.py
Outdated
@@ -69,7 +69,7 @@ | |||
|
|||
|
|||
if env_util.HasAllMultiClientEnvVars(): | |||
env_util.env_init(True) | |||
env_util.api_env_init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里错了吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对,改错了
Signed-off-by: daquexian <daquexian566@gmail.com>
@@ -33,4 +33,7 @@ ONEFLOW_API_PYBIND11_MODULE("", m) { | |||
|
|||
m.def("GetRank", &GetRank); | |||
m.def("GetWorldSize", &GetWorldSize); | |||
m.def("GetNodeSize", &GetNodeSize); | |||
m.def("GetLocalRank", &GetLocalRank); | |||
m.def("IsMultiClient", &IsMultiClient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python端的使用方式是不是
import oneflow._oneflow_internal
oneflow._oneflow_internal.IsMultiClient()
这样?
@@ -203,3 +208,8 @@ def get_world_size(): | |||
|
|||
""" | |||
return oneflow._oneflow_internal.GetWorldSize() | |||
|
|||
|
|||
@oneflow_export("distributed.is_multi_client") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
哦哦哦看到这里了
@@ -388,6 +384,33 @@ def GetEnvDefaultParallelConf(device_tag): | |||
return device_tag2default_parallel_conf[device_tag] | |||
|
|||
|
|||
def HasAllMultiClientEnvVars(): | |||
return ( | |||
os.getenv("MASTER_ADDR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个判断条件应该是不对的,getenv 是拿到对应的 string 值,但是 and 的结果是 0,实测:
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "12139"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
is_multi_client = (os.getenv("MASTER_ADDR") and os.getenv("MASTER_PORT")
and os.getenv("WORLD_SIZE") and os.getenv("RANK") and os.getenv("LOCAL_RANK"))
print("is_multi_client", is_multi_client)
输出:
is_multi_client 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
试验了一下,这个 0 是字符串 0,是最后一个 string(os.getenv("LOCAL_RANK")
)的值,if is_multi_client
还是会走 True 的分支,但 HasAllMultiClientEnvVars() 这个函数确实该返回 True/False,我改下
#5008 里和 multi client 本身相关的部分。一部分改动来自 binbin。