Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port and shell #74

Merged
merged 5 commits into from
May 25, 2023
Merged

port and shell #74

merged 5 commits into from
May 25, 2023

Conversation

shh2000
Copy link
Collaborator

@shh2000 shh2000 commented May 17, 2023

  1. cluster_config文件中增加MASTER_PORT,默认为29501(与start_pytorch_task同步)。在某些case添加过程中遇到卡死现象,可能是PORT不可用导致的超时,但c10d默认timeout为00:30:00,导致现象看起来像卡死。可通过修改MASTER_PORT来尝试修正

  2. run.dev添加master-port相关逻辑

  3. run.dev在拼接source xx.env前,会检查是否存在这个文件。如不存在,则不拼接;如存在,则拼接,并将source这一句命令的stdout/stderr重定向到log_path/source_env.log.txt。并将source与python3 startxxx拼接方式改为&&。如启动task成功但在log中见不到任何输出,可查看source_env.log.txt,里面有stderr信息

  4. 同步上述修改至dev.py

  5. 已完成master-port流程测试

  6. 已完成无env文件,正确env文件,语法错误env文件测试

  7. 上述测试均在faster-rcnn stdcase中进行,以能够开始训练第一个epoch为测试成功标准

@shh2000 shh2000 reopened this May 19, 2023
if (os.path.isfile(env_file)):
start_cmd = "cd " + dp_path + " && " + sys.executable \
+ " utils/container_manager.py -o runcmdin -c " \
+ container_name + " -d -r \"source " + env_file \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_task_cmd 里也包含了 source env_file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对的。这些环境变量尚未在进入容器前被配置好,因此在start_task_cmd(手动)中需要先source env,然后python3 startxx.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已更新dev.py,对后面的start_task_cmd也做了如此处理。但start_task_cmd中存在envfile时,不需要重定向,只需要把;改成&&。因为这条命令是在容器内手动输入的,可以直接看到报错并停止运行
另外,run.py无更新内容。start_task_cmd是dev.py特有的,用于输出手动命令的。run.py中无相关逻辑

@upvenly upvenly merged commit 2b3289a into FlagOpen:main May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants