Open
Description
Lately, we see continuous failures to rollout new MD in Azure environments.
The error is always about machine-controller-webhook timing out. Error is seen in kubeone as well as KKP user-clusters.
Some API (mostly about VM sizes) in azure has become very slow (or we need better filters in our API call)
Here are logs from KKP user-cluster based MD
failed to create machine deployment: Internal error occurred: failed calling webhook "machine-controller.kubermatic.io-machinedeployments": failed to call webhook: Post "https://machine-controller-webhook.cluster-XXXXX.svc.cluster.local./machinedeployments?timeout=10s": context deadline exceeded
{
"error": {
"code": 500,
"message": "failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'"
}
}
I have seen that if I increase wehbook timeout to 30s situation improves a bit.
But in general - since webhook can only have max 30s timeout - we should consider caching the list of VMs to speed things up.