Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Radius on AKS Automatic causes Node failures #7676

Closed
1 task
loekd opened this issue Jun 11, 2024 · 8 comments
Closed
1 task

Running Radius on AKS Automatic causes Node failures #7676

loekd opened this issue Jun 11, 2024 · 8 comments
Assignees
Labels
bug Something is broken or not working as expected important This item is a high priority Issue we intend to address as soon as possible triaged This issue has been reviewed and triaged

Comments

@loekd
Copy link

loekd commented Jun 11, 2024

Steps to reproduce

  • Deploy Radius to AKS Automatic
  • Deploy workload
  • Scale workload to many instances

Observed behavior

  • After some time, nodes from the node pool will crash

image
image

  • They recover after some time

Desired behavior

  • No node crashes

Workaround

  • No workaround
  • Does not happen when running workloads without Radius

rad Version

RELEASE   VERSION   BICEP     COMMIT
0.34.0    v0.34.0   0.34.0    0fd82e7eaa9388fead4ea76cb9137ba2a225a236

Operating system

Ubuntu (default node)

Additional context

No response

Would you like to support us?

  • Yes, I would like to support you

AB#12487

@loekd loekd added the bug Something is broken or not working as expected label Jun 11, 2024
@radius-triage-bot
Copy link

👋 @loekd Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

@sylvainsf sylvainsf added the triaged This issue has been reviewed and triaged label Jun 13, 2024
@radius-triage-bot
Copy link

👍 We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

@sylvainsf sylvainsf added the important This item is a high priority Issue we intend to address as soon as possible label Jun 13, 2024
@radius-triage-bot
Copy link

We've prioritized work on this issue. Please subscribe to this issue for notifications, we'll provide updates as we make progress.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

@rynowak
Copy link
Contributor

rynowak commented Aug 6, 2024

The likely cause of this is that the UCP APIServer Extension went down. We're using a Kubernetes extensibility point that's pretty heavyweight, and can cause issues like this. The reasons why we chose that approach aren't really true any more and we should migrate away.

A better approach would be for us to port-forward to the control-plane instead of exposing it through the API Server. This would still require the user to have Kubernetes credentials (that's good) but would also be more flexible because users could expose the control-plane in whatever manner they like.

@brooke-hamilton
Copy link
Contributor

/assign @brooke-hamilton

@rynowak
Copy link
Contributor

rynowak commented Aug 6, 2024

A better approach would be for us to port-forward to the control-plane instead of exposing it through the API Server. This would still require the user to have Kubernetes credentials (that's good) but would also be more flexible because users could expose the control-plane in whatever manner they like.

@brooke-hamilton - if you're interested in learning more about this approach, this is what the ArgoCD CLI does. Argo's control-plane is running inside the cluster, and (by default) they port-forward so the CLI can talk to it. https://github.com/argoproj/argo-cd/blob/master/pkg/apiclient/apiclient.go#L206

@brooke-hamilton
Copy link
Contributor

I opened this issue as related. Azure/AKS#4513.

@brooke-hamilton
Copy link
Contributor

brooke-hamilton commented Sep 19, 2024

I believe this behavior is not related to Radius specifically. I was able to reproduce node crashes by deploying a single (non-Radius) pod to an AKS auto cluster and scaling it up to 100 instances. @loekd please comment if there is any other context we should consider. Thank you for reporting the issue! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not working as expected important This item is a high priority Issue we intend to address as soon as possible triaged This issue has been reviewed and triaged
Projects
None yet
Development

No branches or pull requests

5 participants