Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Azure Workload Identity not working with fsspec in flytekit #3962

Closed
2 tasks done
devictr opened this issue Aug 15, 2023 · 24 comments
Closed
2 tasks done

[BUG] Azure Workload Identity not working with fsspec in flytekit #3962

devictr opened this issue Aug 15, 2023 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@devictr
Copy link

devictr commented Aug 15, 2023

Describe the bug

For workload identity to work within AzureBlobFileSystem, the filesystem needs to be instantiated with an account name and anon set to False. Currently, due to how get_filesystem() is set up, we cannot pass the right kwargs for Workload Identity to work.

The workaround is to set AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY, which will allow flytekit to talk to blob storage, but it would be nice to support Workload Identity directly.

This relates to the discussion in #3945

Expected behavior

Enabling Azure Workload Identity should work out of the box when flytekit tries to read/write to blob storage.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@devictr devictr added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Aug 15, 2023
@devictr
Copy link
Author

devictr commented Aug 15, 2023

@wild-endeavor Setting fsspec environment variables didn't work.
I tried setting these in my PodTemplate:

FSSPEC_abfs_account_name : <myaccountname>
FSSPEC_abfs_anon : False

And I'm getting the following error:

[1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[a6qhv7lvbn2tjflkw5t7-n0-0] terminated with exit code (1). Reason [Error]. Message: 
                                                │
│                                                                              │
│ ❱ 506 │   │   _download_distribution(additional_distribution, dest_dir)      │
│                                                                              │
│ /usr/local/lib/python3.8/site-packages/flytekit/core/utils.py:295 in wrapper │
│                                                                              │
│ ❱ 295 │   │   │   │   return func(*args, **kwargs)                           │
│                                                                              │
│ /usr/local/lib/python3.8/site-packages/flytekit/tools/fast_registration.py:1 │
│ 13 in download_distribution                                                  │
│                                                                              │
│ ❱ 113 │   FlyteContextManager.current_context().file_access.get_data(additio │
│                                                                              │
│ /usr/local/lib/python3.8/site-packages/flytekit/core/data_persistence.py:301 │
│ in get_data                                                                  │
│                                                                              │
│ ❱ 301 │   │   │   raise FlyteAssertion(                                      │
╰──────────────────────────────────────────────────────────────────────────────╯
FlyteAssertion: Failed to get data from 
abfs://flyte/flytesnacks/development/X2KXFBHGJCTXF3U767T7TDKJ4M======/script_mod
e.tar.gz to /root/ (recursive=False).

Original exception: Operation returned an invalid status 'Server failed to 
authenticate the request. Please refer to the information in the 
www-authenticate header.'
ErrorCode:NoAuthenticationInformation

I could try creating and mounting a config file as well, but it's a bit more time consuming and I need to move forward with my deployment.
In any case I think that adding a similar special case for Azure in get_filesystem() would be nice. It's not ideal in terms of code, but it would help make Azure deployments smoother

@devictr
Copy link
Author

devictr commented Aug 16, 2023

It could be that I was missing this on my cluster: https://github.com/Azure/azure-workload-identity/tree/main/charts/workload-identity-webhook

I'll try again once it's installed

@devictr
Copy link
Author

devictr commented Aug 16, 2023

Nevermind! Turns out this helm chart is not necessary on managed clusters. I also was able to access blob storage from a task by instantiating the filesystem with the right parameters I've described above. So this bug is still valid

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Aug 18, 2023
@wild-endeavor
Copy link
Contributor

left a comment on the other issue as well.

okay so the ask is to make the flytekit invocation of fsspec implementations be able to have default options for setting up the instance, and potentially different set of args for the put/get calls?

@devictr
Copy link
Author

devictr commented Aug 23, 2023

the ask is to make the flytekit invocation of fsspec implementations be able to have default options for setting up the instance

Yes, similarly to the S3 and GCS code paths in get_filesystem(), we need to pass specific kwargs (account_name and anon) to the fsspec filesystem constructor for Azure auth to work.

@devictr
Copy link
Author

devictr commented Aug 23, 2023

@wild-endeavor I can make a PR to add that to get_filesystem() if you'd like. I know you're planning to refactor, but this is a 2 lines change that could help inform your refactor as well

@wild-endeavor
Copy link
Contributor

yes please, no rush, whenever you get a chance.

@fiedlerNr9
Copy link

Just stumbled across this and would love to see it!

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Sep 6, 2023

I think I found another place which needs to be updated - encoding pandas dataframes as parquet or other file formats. e.g. https://github.com/flyteorg/flytekit/blob/5c23325ac7317bb6a3e81aa22e654ace8eec1e1b/flytekit/types/structured/basic_dfs.py#L87-L109. In this code path it uses fsspec's url_to_fs function https://github.com/fsspec/filesystem_spec/blob/4eeba2ba2da6ec39ac98fa7e02af53b208446106/fsspec/core.py#L343

It seems to work fine apart from we need to set anon=False on Azure and probably other filesystems too. Currently it only does that for S3 https://github.com/flyteorg/flytekit/blob/5c23325ac7317bb6a3e81aa22e654ace8eec1e1b/flytekit/types/structured/basic_dfs.py#L30-L36.

At least in this case setting account_name is not required so long as you use a path format that fsspec understands it extracts that automatically in url_to_fs. Paths like abfs://<container-name>@<account-name>.dfs.core.windows.net/<path within container> work but abfs://<account-name>/<container-name>/<path within container> does not work.

@fiedlerNr9
Copy link

Nice Catch! One thing i did not get is where fsspec retrieves the value for account_name if you set the path like abfs://<container-name>@<account-name>.dfs.core.windows.net/<path within container> ? Can you show me where this is happening?

Other than that, i think it would be best to add a azure_setup_args in this function and get the account_name using this adlfs function.
What do you think?

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Sep 19, 2023

Nice Catch! One thing i did not get is where fsspec retrieves the value for account_name if you set the path like abfs://<container-name>@<account-name>.dfs.core.windows.net/<path within container> ? Can you show me where this is happening?

I haven't re-run the code to confirm but I think url_to_fs calls _un_chain which in turn calls _get_kwargs_from_urls.

Other than that, i think it would be best to add a azure_setup_args in this function

That is what I ended up doing with https://github.com/Tom-Newton/flytekit/commits/tomnewton/add_azure_data_config. With this I actually had to go back to using paths relative to the storage account abfs://<container-name>/<path within container> to avoid error about duplicate key word arguments for account_name and use environment variables to set the account name.

and get the account_name using this adlfs function. What do you think?

I tried modifying flytekit to use full paths and url_to_fs to create the filesystem objects instead of just relying on the protocol. https://github.com/Tom-Newton/flytekit/commits/tomnewton/use_paths_for_initialising_fsspec
However, I gave up on this solution though because the outputPrefix which gets passed into the task when it it launched is of the format abfs://<container-name>/<path within container>. So I ended up configuring the account_name and the other important secrets with special flyte specific environment variables. https://github.com/Tom-Newton/flytekit/commits/tomnewton/add_azure_data_config

In summary I think something like https://github.com/Tom-Newton/flytekit/commits/tomnewton/add_azure_data_config is the best solution for a few reasons:

  1. Its the only one I could get to work just with modifications to flytekit. I think going all in on using full paths would require changes to flyte propellor.
  2. The change just implements a pattern that already exists for S3 and GCS.
  3. Can configure permissions globally with a default pod template and it doesn't impact use of any other Azure authentication you might want to use inside the task.

For reference there was some slack discussion: https://flyte-org.slack.com/archives/C01P3B761A6/p1694537671789039?thread_ts=1693319124.713779&cid=C01P3B761A6

@wild-endeavor
Copy link
Contributor

@Tom-Newton @fiedlerNr9 - yeah let's just go ahead with modifying the current constructs now to support azure better. Unless you can think of a more seamless way to manage the configuration necessary across the cloud platforms. Could one/both of you open up a pr?

One thing I'm pretty sure we still shouldn't do is muck around with the fsspec configs directly - that is flytekit should not get in the way of raw fsspec (as this can change behavior inexplicably for the user).

thank you all

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Sep 19, 2023

I opened a draft PR flyteorg/flytekit#1842. This is based on the changes that @devictr and I are currently using. I've left it as a draft because I think we should add some some unit tests and I was going to consider renaming a couple of things and adding support for yaml config. I'm not sure how much time I will have to work on this right now so if anyone wants to take it and push it through PR review then go for it.

@fiedlerNr9
Copy link

fiedlerNr9 commented Sep 19, 2023

one last comment:
Does this means by setting these special flyte specific env variables its possible to write/read to the storage account configured for stow AND another storage account defined by the special flyte specific env variables?

Also want to raise awareness of this PR for stow: flyteorg/stow#9

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Sep 19, 2023

one last comment: Does this means by setting these special flyte specific env variables its possible to write/read to the storage account configured for stow AND another storage account defined by the special flyte specific env variables?

Maybe yes, but I don't really understand that usecase. As I understand it we need the golang code using stow to be able to read and write to the same place as the python code using fsspec.

The way I see it the special flyte specifc env variables should be used so the python code can connect to the same storage account configured for stow. If you want your python task to authenticate to other Azure services or storage accounts you can do that completely as normal because we used non-standard environment variables to avoid conflicts.

Also want to raise awareness of this PR for stow: flyteorg/stow#9

Cool. It would certainly be nicer to use Azure AD based auth rather than account key.

@fiedlerNr9
Copy link

The Use case would be to have one storage account for flyte metadata and multiple storage accounts for different flyte projects and domains where user data can be uploaded/downloaded via Flytefiles or Flytedirectories. But like you said, i think it is not possible at the moment because of the relative output format.

If you want your python flytekit code to write to the same storage account where stow is writing metadata, why dont just set the env variable AZURE_STORAGE_ACCOUNT_NAME which is being read by adlfs already? I am using workload identities for azure and with setting this env variable, it works like expected. This way we dont need the special env variables?

@tkent
Copy link

tkent commented Sep 21, 2023

The Use case would be to have one storage account for flyte metadata and multiple storage accounts for different flyte projects and domains where user data can be uploaded/downloaded via Flytefiles or Flytedirectories.

FWIW, we also have this use case, but it's slightly different. We have several pre-existing storage accounts we need to connect to within tasks and would like to use FlyteFile and FlyteDirectory to interact with them.

@kumare3
Copy link
Contributor

kumare3 commented Sep 21, 2023

also cc @gvashishtha and folks from blackshark who have been deploying to Azure.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Sep 21, 2023

The Use case would be to have one storage account for flyte metadata and multiple storage accounts for different flyte projects and domains where user data can be uploaded/downloaded via Flytefiles or Flytedirectories.

FWIW, we also have this use case, but it's slightly different. We have several pre-existing storage accounts we need to connect to within tasks and would like to use FlyteFile and FlyteDirectory to interact with them.

I would suggest using fsspec directly instead of FlyteFile and FlyteDirectory. That way we don't need to pass more parameters through flyte and you can do everything the same as if not using flyte. Or failing that I maybe FlyteFile and FlyteDirectory could be modified to accept fsspec storage options so you can configure it on a per call basis instead of trying to set it with some global config.

If you want your python flytekit code to write to the same storage account where stow is writing metadata, why dont just set the env variable AZURE_STORAGE_ACCOUNT_NAME which is being read by adlfs already? I am using workload identities for azure and with setting this env variable, it works like expected. This way we dont need the special env variables?

I want flytekit to always write to that same storage account but my python task should be free to do anything using any Azure authentication and any blob storage account. AZURE_STORAGE_ACCOUNT_NAME is a standard environment variable that other libraries will look for. When set it will force those libraries to read/write to the flyte storage account. It's worth noting though that my proposed changes won't stop AZURE_STORAGE_ACCOUNT_NAME from working as it does today.

The problem with using workload identities is that it unavoidably uses standard environment variables, so it's always going to interfere with whatever other Auth the task might use.

@fiedlerNr9
Copy link

I get your point but it would be a pity if we cannot have FlyteFile or Flytedirectory on Azure - its such an amazing feature.
The best solution would be to find out how to get the absolute path out of propeller, get rid of all storage_account env variables, parse the storage account in flytekit out of the absolute path and pass it down to fsspec. This way any storage account should be accessible.

I also get your second point and i am interested what other libraries you use. I have AZURE_STORAGE_ACCOUNT_NAME set, workload identities activated for my tasks and i can interact with every storage account using

from azure.storage.blob import BlobServiceClient
from azure.identity import DefaultAzureCredential

#f.e.
blob_service_client = BlobServiceClient(
        f"https://{storage_account}.blob.core.windows.net",
        credential=DefaultAzureCredential(),
    )
blob_client = blob_service_client.get_blob_client(container="demo", blob=blob)
with open(file=local_path, mode="rb") as upload_file:
   blob_client.upload_blob(upload_file, overwrite=True)

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Sep 26, 2023

I get your point but it would be a pity if we cannot have FlyteFile or Flytedirectory on Azure - its such an amazing feature. The best solution would be to find out how to get the absolute path out of propeller, get rid of all storage_account env variables, parse the storage account in flytekit out of the absolute path and pass it down to fsspec. This way any storage account should be accessible.

I agree it would be nice if everything supported absolute paths. I wouldn't want to get rid of all environment variables though, because its still very valuable to be able to configure auth for flytekit without interrupting other Azure auth. For now I think my proposed changes are the best option. Using the new env vars are optional so it should be a backwards compatible change.

For writing to multiple different storage accounts with FlyteFile and Flytedirectory I think we should start a new issue (this ticket was about workload identity so probably I already derailed it enough 😅). I think currently flyte has no support for this on S3 or GCS either.

i am interested what other libraries you use

For me AZURE_STORAGE_ACCOUNT_NAME broke https://delta-io.github.io/delta-rs/python/. This uses https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html under the hood. I expect there would be other libraries too that have similar issues.

@fiedlerNr9
Copy link

Thanks for getting back and nice PR so far 🚀 I agree with this being the best option at the moment. Tomorrow i will test your Branch on a live cluster and try a couple of things.

And yes, lets close this issue after your PR is merged ;)

@Tom-Newton
Copy link
Contributor

Between flyteorg/flytekit#1813 and flyteorg/flytekit#1842 I think its fair to close this as complete.

@cgrass
Copy link
Contributor

cgrass commented Nov 7, 2023

I created a small PR to extend the flyte-binary helm chart for azure storage support. I tested 1813 and 1842 to confirm the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants