Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] use of "url_to_fs" from fsspec.core in FileAccessProvider #3945

Open
2 tasks done
timheb opened this issue Aug 11, 2023 · 7 comments
Open
2 tasks done
Assignees
Labels
enhancement New feature or request flytekit FlyteKit Python related issue stale

Comments

@timheb
Copy link

timheb commented Aug 11, 2023

Motivation: Why do you think this is important?

Currently it is difficult to use a custom fsspec protocol in flytekit. You can only implement configurations for s3 or gs. With url_to_fs you could implement dynamic configurations in your own protocol which can also be used as raw_output_prefix.

Goal: What should the final outcome look like, ideally?

You can define your own fsspec protocols which are based on existing ones. Special credencial handling can be done in _get_kwargs_from_urls. Also, the protocol can be used as raw_output_prefix

example:

class MyFileSystem(AzureBlobFileSystem):
    @staticmethod
    def _get_kwargs_from_urls(path):
        # ...
        return kwargs

Describe alternatives you've considered

Do not know any alternative

Propose: Link/Inline OR Additional context

Could already be considered here: flyteorg/flytekit#1775

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@timheb timheb added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Aug 11, 2023
@welcome
Copy link

welcome bot commented Aug 11, 2023

Thank you for opening your first issue here! 🛠

@eapolinario eapolinario added flytekit FlyteKit Python related issue and removed untriaged This issues has not yet been looked at by the Maintainers labels Aug 11, 2023
@wild-endeavor
Copy link
Contributor

This is related to #3942. The PR there is attempting to add Azure specific args to the data handling portion of the pyflyte run code. There's two issues here -

  1. basically, the data handling portion of pyflyte run, is not the same as the data handling portion in the rest of flytekit. The reason this was done i believe was that at the time we could not get the headers passing correctly to the http fsspec implementation, though this is probably worth another look. Btw, I didn't know this was possible - that the requests library would be called to post to absf://.
  2. Even if this is aligned, I think the issue that these two tickets are getting at is that in the data_persistence layer the call to fsspec is tweaked by flytekit. I think we can remove the anonymous return handling on line 124/125.
  • The reason the additional args are added to flytekit code instead of registering a flyte specific s3 or gcs fsspec implementation is so that flytekit doesn't get in the way of the user if the user wants to invoke fsspec directly in user code.
  • However this leads to the question of how to let users configure how flytekit calls fsspec.

fsspec already supports default arguments to file systems though - https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration. Is this something we can use @devictr?

@timheb
Copy link
Author

timheb commented Aug 15, 2023

My idea would be to remove get_filesystem() from FileAccessProvider and convert the code to get_filesystem_for_path(). You could then call url_to_fs() there and the user could dynamically search for configurations in his fsspec protocol at runtime. For example via a secret management tool like hashicorp vault. It should then work with FlyteFile out of the box

@devictr
Copy link

devictr commented Aug 15, 2023

Btw, I didn't know this was possible - that the requests library would be called to post to absf://.

The signed_url that the requests library is doing a PUT to is an normal https:// URL. The data proxy in flyteadmin seems to be getting a signed URL from Azure and then passes it down to flytekit for the upload.

fsspec already supports default arguments to file systems though - https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration. Is this something we can use @devictr?

I think that could work. That way if we specify our preferred way of creating a filesystem using a config file or env variables, we can leave the kwargs to the user so they can override the default behaviour.

the user could dynamically search for configurations in his fsspec protocol at runtime

@timheb what do you mean by this? It seems like get_filesystem() still allows you to pass any kwargs to the underlying filesystem that gets created.

@timheb
Copy link
Author

timheb commented Aug 21, 2023

@devictr
I have built the following filesystem myfs://{azure_storage_account}/{container_name}. The file system inherits from AzureBlobFileSystem as described in my first post. Within fsspec.core.url_to_fs() the function _get_kwargs_from_urls() is called in my file system and you can dynamically look up credencials depending on the path of the file and return kwargs as required by AzureBlobFileSystem. Such a path should also work as a raw output path in flyteconsole.

@wild-endeavor
Copy link
Contributor

@timheb - I'm not quite sure I follow.

I agree that the current get_filesystem in data_persistence.py presents a bit of problem for s3/gcs users, and presents a problem for anonymous access. The reason it causes issues for s3/gcs users is because that function intercepts those calls, and creates the filesystem object with its own set of kwargs. If you want to inject your own set of kwargs to the file system creation for flytekit calls, you're out of luck. The reason flytekit does not inject the arguments as defaults into the fsspec implementations is because we don't want to interfere with the normal fsspec implementation. that is, flytekit should be able to supply its own set of kwargs when creating and s3 fs and users should be able to supply a different set of kwargs if they choose to instantiate an s3 fs. this is something we want to maintain - for example, in the flyte sandbox mode, flytekit needs to be able to talk to the minio (local s3 implementation) but users should still be able to hit the real s3 if they so chose.

But all this is kinda orthogonal to the issue at hand because you're creating your own file system right? you're trying to get the flytekit data persistence layer to trigger different permissions based on the path. But there's no guarantee that url_to_fs gets called right? even if we remove data_persistence::get_filesystem and only use data_persistence:: get_filesystem_for_path and within that invoke ``fsspec.core::url_to_fs`, that only works when we go that route right? If ever a plugin or something calls your fs directly, then the auth logic won't trigger. basically you're saying myfs's api requires that you first configure the instance correctly for every call. I don't think this is the right approach. can you instead override all the public facing apis that AzureBlobFileSystem has to do this check first and then super()?

Wrt the things we can do to improve the current data_persistence::get_filesystem, we can

  • make the anonymous handling better in the default case.
  • make it so that you can specify default kwargs by protocol for filesystem creation
  • make it so that you can specify default kwargs by protocol for filesystem actions (put/get/etc)

For background, the reason we need the anonymous handling is because s3 (and i think gcs as well) will fail to fetch a public object if you try to fetch it with credentials. weird, but that's how it is. because flytekit accesses arbitrary objects, it tries to pull anonymously if the non-anonymous way fails.

Copy link

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request flytekit FlyteKit Python related issue stale
Projects
None yet
Development

No branches or pull requests

4 participants