Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Allow zero replica for workers for Helm #968

Merged
merged 2 commits into from
Jun 12, 2023

Conversation

ducviet00
Copy link
Contributor

@ducviet00 ducviet00 commented Mar 16, 2023

Why are these changes needed?

We are currently using Ray for computing heavily tasks on GKE. When initializing, it spawns a worker each worker group. Then, it triggers GKE scale up node. It's money cost.

This happens because ternary function in template file. {{ 0 | 1 }} = 1

minReplicas: {{ $values.minReplicas | default (default 1 $values.miniReplicas) }}

minReplicas: {{ .Values.worker.minReplicas | default (default 1 .Values.worker.miniReplicas) }}

workaround by setting default replica to zero.

Related issue number

Open #965

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! I am wondering is there any difference for your use case between disabled: true and replicas: 0?

@ducviet00
Copy link
Contributor Author

ducviet00 commented Mar 17, 2023

Thank you for the contribution! I am wondering is there any difference for your use case between disabled: true and replicas: 0?

as I understand, minReplicas: 0 allow scaling down pod to zero and disabled: true doesn't allow scale up pods.

@ducviet00
Copy link
Contributor Author

ducviet00 commented Mar 17, 2023

I think setting replica: 0 as the default is better than setting replica: 1 because we shouldn't create a replica initially. A worker requests much memory, so it's a waste of resources. The autoscaler will handle it based on the job's resource demand. Setting minReplicas: 0 to allows it and allows scale down to zero when no job running.

@ducviet00
Copy link
Contributor Author

@kevin85421 Could you make a review?

@kevin85421 kevin85421 self-requested a review May 30, 2023 20:19
@kevin85421 kevin85421 self-assigned this May 30, 2023
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test this PR manually using this gist.

# Step 0: Replace values.yaml with the gist
# (path: helm-chart/ray-cluster)
helm install ray-cluster .

# Step 1: Try to scale up the cluster
export HEAD_POD=$(kubectl get pods -o custom-columns=POD:metadata.name | grep raycluster-autoscaler-head)
kubectl exec $HEAD_POD -it -c ray-head -- python -c "import ray;ray.init();ray.autoscaler.sdk.request_resources(num_cpus=4)"

# Step 2: The RayCluster will scale from 0 worker to 3 workers.

@kevin85421 kevin85421 merged commit 7ad3acf into ray-project:master Jun 12, 2023
@kevin85421 kevin85421 mentioned this pull request Jun 22, 2023
2 tasks
@yc2984
Copy link

yc2984 commented Jul 21, 2023

@kevin85421 is this available on 0.5.2?

@yc2984
Copy link

yc2984 commented Jul 21, 2023

@kevin85421 is this available on 0.5.2?

I see it's only on 0.6.0. Is it stable or still WIP?

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants