Description
Describe the bug
Installing an AKS extension on AKS Automatic, or installing a Kubernetes add-in that requires a tolerated taint of CriticalAddonsOnly
, takes about an hour to install, and the installers return timeout errors. Examples are Dapr and Radius.
To Reproduce
Steps to reproduce the behavior:
This example installs Dapr using the az k8s-extension create
command. The Radius commandrad install kubernetes
results in similar behavior.
resource_group="<resource group name>"
cluster_name=$(mktemp -u "$resource_group-XXXX")
echo "Creating cluster $cluster_name"
az aks create --resource-group $resource_group --name "$cluster_name" --sku automatic --generate-ssh-keys -l southcentralus
az aks get-credentials --resource-group $resource_group --name "$cluster_name"
az k8s-extension create --cluster-type managedClusters \
--cluster-name "$cluster_name" \
--resource-group $resource_group \
--name dapr \
--extension-type Microsoft.Dapr \
--auto-upgrade-minor-version false
Expected behavior
The extension installs as expected.
Screenshots
Results of the az k8s-extension create
command:
(ExtensionOperationFailed) The extension operation failed with the following error: Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state Last resource not ready was dapr-system/dapr-monitoring-metrics For more events check kubernetes events using kubectl events -n dapr-system : Recommendation Please contact Microsoft support for further inquiries : InnerError [release dapr failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: For additional troubleshooting information, please see https://aka.ms/dapr-aks.
Code: ExtensionOperationFailed
Message: The extension operation failed with the following error: Error: [ InnerError: [Helm installation failed : Timed out waiting for the resource to come to a ready/completed state Last resource not ready was dapr-system/dapr-monitoring-metrics For more events check kubernetes events using kubectl events -n dapr-system : Recommendation Please contact Microsoft support for further inquiries : InnerError [release dapr failed, and has been uninstalled due to atomic being set: timed out waiting for the condition]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: For additional troubleshooting information, please see https://aka.ms/dapr-aks.
While the installer is running, this command returns the results below. You can see that the Dapr namespace took about 30 minutes to deploy and that none of the Dapr pods have started. At this point the Dapr installer was still running.
$ kubectl get nodes && kubectl get namespaces && kubectl get pods -n dapr-system
NAME STATUS ROLES AGE VERSION
aks-nodepool1-91565423-vmss000000 Ready <none> 37m v1.29.7
aks-nodepool1-91565423-vmss000001 Ready <none> 37m v1.29.7
aks-nodepool1-91565423-vmss000002 Ready <none> 37m v1.29.7
NAME STATUS AGE
app-routing-system Active 21m
dapr-system Active 7m9s
default Active 38m
gatekeeper-system Active 37m
kube-node-lease Active 38m
kube-public Active 38m
kube-system Active 38m
NAME READY STATUS RESTARTS AGE
dapr-monitoring-metrics-77bbd48f79-j2dz6 0/4 Pending 0 6m25s
dapr-operator-7c77dccfcb-7t5vr 0/1 Pending 0 6m25s
dapr-operator-7c77dccfcb-vr754 0/1 Pending 0 6m25s
dapr-operator-7c77dccfcb-zvg26 0/1 Pending 0 6m25s
dapr-placement-server-0 0/1 Pending 0 6m25s
dapr-placement-server-1 0/1 Pending 0 6m25s
dapr-placement-server-2 0/1 Pending 0 6m25s
dapr-sentry-74d8d794fd-49jkv 0/1 Pending 0 6m25s
dapr-sentry-74d8d794fd-wk69h 0/1 Pending 0 6m25s
dapr-sentry-74d8d794fd-zdfvf 0/1 Pending 0 6m25s
dapr-sidecar-injector-74cb4f87fd-9fk42 0/1 Pending 0 6m25s
dapr-sidecar-injector-74cb4f87fd-c6fns 0/1 Pending 0 6m25s
dapr-sidecar-injector-74cb4f87fd-jddhb 0/1 Pending 0 6m25s
Examining the nodes that fail to deploy, the error below is shown in the Azure Portal:
involvedObject:
kind: Pod
namespace: dapr-system
name: dapr-operator-7c77dccfcb-7t5vr
uid: 11632e33-5066-4abc-97ec-c6c74945a78c
apiVersion: v1
resourceVersion: '12726'
reason: FailedScheduling
message: >-
0/3 nodes are available: 3 node(s) had untolerated taint {CriticalAddonsOnly:
true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for
scheduling.
Eventually, AKS Automatic creates an additional node without the CriticalAddonsOnly
taint, and the pods are able to start.
$ kubectl get nodes && kubectl get namespaces && kubectl get pods -n dapr-system
NAME STATUS ROLES AGE VERSION
aks-default-nkpzb Ready <none> 116s v1.29.7
aks-nodepool1-91565423-vmss000000 Ready <none> 50m v1.29.7
aks-nodepool1-91565423-vmss000001 Ready <none> 50m v1.29.7
aks-nodepool1-91565423-vmss000002 Ready <none> 50m v1.29.7
NAME STATUS AGE
app-routing-system Active 35m
dapr-system Active 20m
default Active 52m
gatekeeper-system Active 51m
kube-node-lease Active 52m
kube-public Active 52m
kube-system Active 52m
NAME READY STATUS RESTARTS AGE
dapr-monitoring-h5tvr 3/3 Running 0 98s
dapr-monitoring-metrics-77bbd48f79-66srf 4/4 Running 0 2m51s
dapr-operator-7c77dccfcb-kj4tm 1/1 Running 0 2m51s
dapr-operator-7c77dccfcb-nppcj 1/1 Running 0 2m51s
dapr-operator-7c77dccfcb-wztws 1/1 Running 0 2m51s
dapr-placement-server-0 1/1 Running 0 2m51s
dapr-placement-server-1 1/1 Running 0 2m51s
dapr-placement-server-2 1/1 Running 0 2m51s
dapr-sentry-74d8d794fd-k6mdx 1/1 Running 0 2m51s
dapr-sentry-74d8d794fd-ldbl9 1/1 Running 0 2m51s
dapr-sentry-74d8d794fd-rmb6j 1/1 Running 0 2m51s
dapr-sidecar-injector-74cb4f87fd-cqr26 1/1 Running 0 2m51s
dapr-sidecar-injector-74cb4f87fd-ldfjm 1/1 Running 0 2m51s
dapr-sidecar-injector-74cb4f87fd-p7wvn 1/1 Running 0 2m51s
Environment (please complete the following information):
- CLI Version [e.g. 3.22]
az version
{
"azure-cli": "2.63.0",
"azure-cli-core": "2.63.0",
"azure-cli-telemetry": "1.1.0",
"extensions": {
"aks-preview": "7.0.0b8",
"k8s-extension": "1.6.1"
}
}
- Kubernetes version: 1.29.7
- SKU: AKS Automatic
- CLI Extension version: See above
- Browser: Edge
Additional context
It looks like the add-ins like Dapr and Radius cannot tolerate the CriticalAddonsOnly
taint. Eventually AKS Automatic adds a new node that does not have the taint, and the Dapr nodes are able to start.
Adding nodes to AKS Automatic results in new nodes that have the taint applied.
Adding a node without the taint fails with an error.
$ az aks nodepool update --resource-group brhamilt-aks --cluster-name brhamilt-aks-Caoh --name nodepool1 --node-taints ""
The behavior of this command has been altered by the following extension: aks-preview
(BadRequest) Managed cluster 'Automatic' SKU should set taint 'CriticalAddonsOnly=true:NoSchedule' for 'System' node pool
Code: BadRequest
Message: Managed cluster 'Automatic' SKU should set taint 'CriticalAddonsOnly=true:NoSchedule' for 'System' node pool
Removing the taint results in an error.