-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP broken pipe error #219
Comments
FWIW, I tried adding 50 more nodes to this cluster and it managed to add 22 more nodes before failing like this: Logs
In the end, I have 67 workers (also 67 VMs). Update This second set of errors can be ignored. This is happening because I'm hitting IP quota limits at 67 nodes. The problem in the OP remains though. |
Do you think this due to a timeout issue? |
It could be since there was a delay between when I created the cluster and when I tried to scale it up, somewhat like in #179. In this case, it worked 49/50 times whereas in #179 no new workers were created at all. The stack traces look to be about the same so I'm not sure why the fix for that issue didn't work here if it's the same problem. |
I got this again today after creating an adaptive cluster and then almost immediately (within 5 minutes) having it scale up by running a large number of tasks. It's making adaptive clusters pretty much unusable because after this error, they stop scaling up or down. Perhaps |
I expect this is due to a race condition. We create an instance dask-cloudprovider/dask_cloudprovider/gcp/instances.py Lines 191 to 195 in ad4eb0e
Then immediately list all instances to find out info about it dask-cloudprovider/dask_cloudprovider/gcp/instances.py Lines 230 to 234 in ad4eb0e
When creating a small number of instances this is probably fine. However when requesting larger number of instances then GCP may not be as quick to fulfil the requests. From a quick scan of the code I think we need to implement a retry in The dask-cloudprovider/dask_cloudprovider/utils/timeout.py Lines 9 to 10 in ad4eb0e
|
I noticed this error when running a 50 node cluster:
I only end up with 49 workers (and 49 worker VMs) so I believe this caused one of them not to launch.
This same error occurs in #218.
The text was updated successfully, but these errors were encountered: