Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP/Spot] Skypilot GCP user credentials expire on controller with SSO #2738

Open
kuza55 opened this issue Oct 26, 2023 · 9 comments
Open

[GCP/Spot] Skypilot GCP user credentials expire on controller with SSO #2738

kuza55 opened this issue Oct 26, 2023 · 9 comments
Labels
bug Something isn't working Stale

Comments

@kuza55
Copy link

kuza55 commented Oct 26, 2023

Hi,

I have been trying to use skypilot's spot scheduling.

I have run into issues where the controller becomes unresponsive and the ~/.sky/skylet.log file is filled with this error:

google.auth.exceptions.RefreshError: Reauthentication is needed. Please run `gcloud auth application-default login` to reauthenticate.

My understanding of the issue is that user credentials and access tokens derived from them expire in relatively short time windows, though I am poking at my setup to see if I am doing something weird: https://stackoverflow.com/questions/69229759/longer-lasting-user-credentials-with-gcloud-auth-prevent-expiration

I have seen suggestions to create service accounts and interact with gcp using a service account, which will likely address this issue, but that feels at least like a documentation bug.

Alternatively using the service account that skypilot creates seems like it would make sense to me.

  • Alex
@ethansiegl
Copy link

I've also run into this issue but could not resolve

@kuza55
Copy link
Author

kuza55 commented Oct 31, 2023

I think only service accounts are supported and the docs try to say this but are a little unclear: https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-auth.html#gcp

@zxexz
Copy link

zxexz commented Oct 31, 2023

I run into this issue too. Got around it by forcing a service account using a bash wrapper script, but that's very brittle. When accidentally misconfigured, falls back to user credentials and when the expire, all the bucket mounts die.

@concretevitamin
Copy link
Member

We should definitely fix this. The issue is related to the organization's reauthentication policy set up by cloud admins: https://support.google.com/cloudidentity/answer/9368756?hl=en# (Our dev accounts likely don't have this set, so we never ran into this problem.)

To solve this, we should probably make the spot controller use a long-lived service account so it doesn't need to reauth. I just tried the following on a new Google Cloud email/project, and verified that launching a VM works:

  1. Follow https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/gcp.html#gcp-minimal-permissions to first create an IAM role (minimal-skypilot-role) and then a service account using that role (skypilot-v1).
  2. Follow https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-auth.html#gcp-service-account. In particular:
  • Go to IAM & Admin -> Service Accounts -> click on the service account created above -> KEYS tab -> ADD KEY. This will download a json key to local.
  • Run the rest of the code snippet shown in the above doc (export GOOGLE_APPLICATION_CREDENTIALS ...)
  1. Run sky check to verify GCP is enabled under this new identity (service account).
  2. Run sky launch / other Sky commands.

Let us know if the above works for you? @ethansiegl @kuza55 @zxexz

@concretevitamin
Copy link
Member

concretevitamin commented Oct 31, 2023

Confirmed that with the above,

sky spot launch --cloud gcp --cpus 2+ sleep 1000

started both a new controller (given that no controller exists in sky status) and the spot cluster under the service account.

To check that, you could click on either the controller VM or the spot cluster VM -> OBSERVABILITY -> LOGS, and check that the logs display the service account (see bottom right):

Screen Shot 2023-10-31 at 12 49 09

@zxexz
Copy link

zxexz commented Nov 1, 2023

We should definitely fix this. The issue is related to the organization's reauthentication policy set up by cloud admins: https://support.google.com/cloudidentity/answer/9368756?hl=en# (Our dev accounts likely don't have this set, so we never ran into this problem.)

To solve this, we should probably make the spot controller use a long-lived service account so it doesn't need to reauth. I just tried the following on a new Google Cloud email/project, and verified that launching a VM works:

  1. Follow https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/gcp.html#gcp-minimal-permissions to first create an IAM role (minimal-skypilot-role) and then a service account using that role (skypilot-v1).
  2. Follow https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-auth.html#gcp-service-account. In particular:
  • Go to IAM & Admin -> Service Accounts -> click on the service account created above -> KEYS tab -> ADD KEY. This will download a json key to local.
  • Run the rest of the code snippet shown in the above doc (export GOOGLE_APPLICATION_CREDENTIALS ...)
  1. Run sky check to verify GCP is enabled under this new identity (service account).
  2. Run sky launch / other Sky commands.

Let us know if the above works for you? @ethansiegl @kuza55 @zxexz

Yep. I had actually already followed those docs - it does work - most of the time. it's hard to track down exactly why, but whenever I ssh into the instance, and run gcloud auth list, both the service account and my user account are listed. This seems related to another issue I'm having, where the cluster doesn't die, but after the SSO session timeout, any mounted GCS buckets (from the yaml config) stop working due to permissions issues - seems that even when the cluster is started using the service account, and the ray process is too, the gcsfuse mount is occasionally still done with SSO user creds.

I have a "workaround" for now, in our launch wrapper script I make sure to do a gcloud auth revoke --all before activating the service account. It's annoying, but it has kept the issue from reappearing so far 🤞.

@Michaelvll Michaelvll added the bug Something isn't working label Nov 16, 2023
@Michaelvll Michaelvll changed the title Skypilot GCP user credentials expire on controller [GCP/Spot] Skypilot GCP user credentials expire on controller with SSO Nov 16, 2023

This comment was marked as outdated.

Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

5 participants