Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: gang-scheduling for kubeflow training-operator #3851

Merged

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jul 8, 2023

Describe your changes

In the Flyte Slack community there have been multiple questions recently how timeout errors can be avoided which occur when the workers of distributed training jobs don't start at the same time due to resource constraints.

The best solution to this problem is to use a Kubernetes scheduler which supports gang scheduling. In this PR I add documentation how this can be configured.

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 force-pushed the fg91/doc/kubeflow-pluging-gang-sched branch from 8600dcb to 9432c7e Compare July 8, 2023 10:09
@fg91 fg91 requested review from kumare3 and samhita-alla July 8, 2023 10:10
Copy link
Contributor

@samhita-alla samhita-alla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @fg91!

@samhita-alla samhita-alla merged commit 49868b6 into flyteorg:master Jul 10, 2023
6 checks passed
@kumare3
Copy link
Contributor

kumare3 commented Jul 11, 2023

You are a rockstar @fg91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants