-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make elastic timeout configurable for HorovovJob. #2631
Conversation
Signed-off-by: Chen Zhu <chzhu@linkedin.com>
86d224c
to
803790b
Compare
Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. cc @Future-Outlier @pingsutw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the typo and we're golden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@supercharleszhu
Hi, before this is merge, can you provide screenshot with this branch's commit hash at remote execution?
this is my example: #1870
d274346
to
e01f0d2
Compare
Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
e01f0d2
to
f77e25c
Compare
updated. @eapolinario
@Future-Outlier we are from LinkedIn and committed and tested in our internal branch. The commit hash is not sync with OSS. Let me know screenshot like this make sense or not? (masked some of the internal settings) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2631 +/- ##
===========================================
+ Coverage 41.91% 78.86% +36.94%
===========================================
Files 188 187 -1
Lines 19037 19169 +132
Branches 3715 4002 +287
===========================================
+ Hits 7980 15118 +7138
+ Misses 10948 3353 -7595
- Partials 109 698 +589 ☔ View full report in Codecov by Sentry. |
Congrats on merging your first pull request! 🎉 |
Signed-off-by: Chen Zhu <chzhu@linkedin.com> Signed-off-by: Chen Zhu <zhuchen1033@gmail.com> Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: Chen Zhu <chzhu@linkedin.com> Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
Signed-off-by: Chen Zhu <chzhu@linkedin.com> Signed-off-by: Chen Zhu <zhuchen1033@gmail.com> Signed-off-by: Mecoli1219 <michaellai901026@gmail.com>
Tracking issue
Why are the changes needed?
The default elastic timeout to wait for worker pod to be ready is 60s https://github.com/horovod/horovod/blob/55a73d4f662bf637f6994a61f62b517ad9554dc1/horovod/runner/launch.py#L452C32-L452C51
This PR is to make it configurable
What changes were proposed in this pull request?
Changes:
How was this patch tested?
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link