Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make elastic timeout configurable for HorovovJob. #2631

Merged
merged 3 commits into from
Aug 8, 2024

Conversation

supercharleszhu
Copy link
Contributor

@supercharleszhu supercharleszhu commented Jul 31, 2024

Tracking issue

Why are the changes needed?

The default elastic timeout to wait for worker pod to be ready is 60s https://github.com/horovod/horovod/blob/55a73d4f662bf637f6994a61f62b517ad9554dc1/horovod/runner/launch.py#L452C32-L452C51

This PR is to make it configurable

What changes were proposed in this pull request?

Changes:

  • Increased elastic_timeout to 20 min;
  • make elastic_timeout configurable by user

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

  • [ X ] I updated the documentation accordingly.
  • [ X ] All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Chen Zhu <chzhu@linkedin.com>
Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
ByronHsu
ByronHsu previously approved these changes Jul 31, 2024
Copy link
Collaborator

@ByronHsu ByronHsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the typo and we're golden.

plugins/flytekit-kf-mpi/flytekitplugins/kfmpi/task.py Outdated Show resolved Hide resolved
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@supercharleszhu
Hi, before this is merge, can you provide screenshot with this branch's commit hash at remote execution?

this is my example: #1870

Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
@supercharleszhu
Copy link
Contributor Author

updated. @eapolinario

@supercharleszhu Hi, before this is merge, can you provide screenshot with this branch's commit hash at remote execution?

this is my example: #1870

@Future-Outlier we are from LinkedIn and committed and tested in our internal branch. The commit hash is not sync with OSS. Let me know screenshot like this make sense or not? (masked some of the internal settings)

WX20240807-203835@2x

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.86%. Comparing base (085fa9c) to head (f77e25c).
Report is 28 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2631       +/-   ##
===========================================
+ Coverage   41.91%   78.86%   +36.94%     
===========================================
  Files         188      187        -1     
  Lines       19037    19169      +132     
  Branches     3715     4002      +287     
===========================================
+ Hits         7980    15118     +7138     
+ Misses      10948     3353     -7595     
- Partials      109      698      +589     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pingsutw pingsutw merged commit aadfc49 into flyteorg:master Aug 8, 2024
101 checks passed
Copy link

welcome bot commented Aug 8, 2024

Congrats on merging your first pull request! 🎉

mao3267 pushed a commit to mao3267/flytekit that referenced this pull request Aug 9, 2024
Signed-off-by: Chen Zhu <chzhu@linkedin.com>
Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Mecoli1219 pushed a commit to Mecoli1219/flytekit that referenced this pull request Aug 14, 2024
Signed-off-by: Chen Zhu <chzhu@linkedin.com>
Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
Mecoli1219 pushed a commit to Mecoli1219/flytekit that referenced this pull request Aug 14, 2024
Signed-off-by: Chen Zhu <chzhu@linkedin.com>
Signed-off-by: Chen Zhu <zhuchen1033@gmail.com>
Signed-off-by: Mecoli1219 <michaellai901026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants