[Core] Extend chaos testing utility to more closely replicate spot instance preemptions #46367
Open
Description
Description
We're running a large-scale batch inference job on spot instances and trying to use as many GPUs as possible. We've observed this pattern for preemptions:
To more closely replicate this pattern, it'd be helpful if the chaos testing utility could:
- Kill many nodes at once (e.g., ~200 above)
- Prevent the cluster from scaling back up (to mimic spot instances being unavailable)
Use case
See above.