Skip to content

[Core] Extend chaos testing utility to more closely replicate spot instance preemptions #46367

Open
@bveeramani

Description

Description

We're running a large-scale batch inference job on spot instances and trying to use as many GPUs as possible. We've observed this pattern for preemptions:
image

To more closely replicate this pattern, it'd be helpful if the chaos testing utility could:

  1. Kill many nodes at once (e.g., ~200 above)
  2. Prevent the cluster from scaling back up (to mimic spot instances being unavailable)

Use case

See above.

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions