Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fake TPU e2e Autoscaling Test Cases #2279

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Jul 31, 2024

Why are these changes needed?

This PR adds a fake TPU test case, similar to the existing fake GPU test case for autoscaling, that uses detached actors to verify that single-host autoscaling behaves as expected. The behaviors tested included:

  • (1) Creating a detached actor that requests resources: {"TPU": 4} will scale up a Ray TPU worker
  • (2) For a worker group, the number of workers created should equal replicas * numOfHosts
  • (2) Terminating detached actors scheduled on a TPU worker group replica will cause the entire replica to be scaled down

Edit: Removed test behavior for idle nodes being scaled down, since KubeRay TPU Pods are un-schedulable due to required nodeSelectors.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

test.Expect(GetGroupPods(test, rayCluster, "tpu-group")).To(HaveLen(4))

// Terminating one TPU detached actor will result in the Ray node becoming idle, causing Ray to scale down the entire multi-host
// worker group. A new multi-host worker group will then be scaled back up since the remaining detached actors are running.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior seems a bit unexpected to me. What's the reason we expect a scale down and a scale up again in this scenario?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just be mis-understanding this comment. Should there be an assertion for this part?

A new multi-host worker group will then be scaled back up since the remaining detached actors are running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detached actors keep running when the Ray node they're scheduled on is scaled down, so the autoscaler sees the request for TPUs and scales back up a multi-host worker group to meet the unmet demand. In a regular scenario (i..e non-detached actors), the actors would be terminated along with their respective nodes when the replica scales down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add an assertion that checks that the pod list length becomes 0 before becoming 4 again

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I missed the behavior specific to detached actors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add an assertion that checks that the pod list length becomes 0 before becoming 4 again

sgtm!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this section in 0c6bb58, because getting the node to become idle requires setting the timeout to 5+ minutes which I'd imagine would slow down the presubmit too much. The behavior to scale down a multi-host replica is still tested by deleting the detached actors.

@ryanaoleary
Copy link
Contributor Author

cc: @kevin85421

@kevin85421 kevin85421 self-assigned this Aug 12, 2024
@kevin85421
Copy link
Member

I plan to include this PR in v1.3.0 instead.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Fix tests for updated format

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

Related Issue:
#2561

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

I'm moving the multi-host test case to a separate PR based on offline discussion. For the V2 autoscaler only I'm seeing issues with maxReplicas being enforced (the V1 test case passes but V2 fails to ever delete the Pod), I'll open an issue with reproduction steps to see if I can root cause.

--- FAIL: TestRayClusterAutoscalerWithFakeSingleHostTPU (344.39s)
[2025-01-23T18:58:30Z]     --- PASS: TestRayClusterAutoscalerWithFakeSingleHostTPU/Create_a_RayCluster_with_autoscaling_enabled (26.63s)
[2025-01-23T18:58:30Z]     --- FAIL: TestRayClusterAutoscalerWithFakeSingleHostTPU/Create_a_RayCluster_with_autoscaler_v2_enabled (317.64s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants