-
Notifications
You must be signed in to change notification settings - Fork 502
feat: make namespace related api more friendly for distributed engines #5547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: make namespace related api more friendly for distributed engines #5547
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
bdff78f to
8dc1662
Compare
Code ReviewThis PR introduces the P0 IssuesNone identified. P1 Issues
Observations
|
Code ReviewThis PR refactors how storage options and credential providers are handled by introducing a unified SummaryThe changes unify the previous separate P1 Issues
Minor Observations
The architecture change looks solid overall. The main concern is the blocking Hash impl which should be addressed before merge. |
in distributed engines situation, we have the following problem with vended credentials: we pass in the namespace and table ID to get location and allow dynamic credentials refresh. Then the table is cached and used for serving multiple queries.
When executing in another worker (e.g. spark, lancedb enterprise, etc.), we have to basically fetch the credentials again because we don't know what is the current credentials to use, and the credentials could already been refreshed and is different from the initial input.
This PR adds an API
dataset.current_storage_options()to get the latest storage options to be used, so that it can be served as the initial storage options to use in the worker node. This ensures we only make a single call tonamespace_client.describe_table. Note that the engine should configure the credentials refresh lead time to be long enough that it is sufficient to use a single credential in the work in most cases.Another problem is that when distributing to the worker, we already know the location of the table and the storage options to use, so we should just pass that in and use it. But today the API is an either-or, user either pass in uri or pass in namespace + table ID and it would reload uri and storage options. We added documentation and updated API so that if user pass in namespace + table_id, we do the automated workflow to get uri and storage options and set provider as usual, but also give caller the option to set each component manually to match various caching conditions.