-
Notifications
You must be signed in to change notification settings - Fork 45
Add ObjectStoreRegistry
#348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ObjectStoreRegistry
#348
Conversation
Couple of open questions:
|
Looks like we would need to drop dashmap to 5.5.2 (https://github.com/xacrimon/dashmap/blob/v5.5.2/Cargo.toml) to maintain the MSRV you guys have set (1.64). 6.1.0 requires 1.6.5 rust. |
Rust 1.65 is 2.5 years old; I don't have a problem with setting that as MSRV (though I don't have final say) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we gate this by a feature flag given the additional dependency.
I think we also probably want to see how DF might switch to using this formulation, perhaps file a DF issue requesting help if you don't have capacity? I think we should make sure what we're adding here can be used downstream.
Edit: tbh regarding dashmap I'd be inclined to just use RWLock around a HashMap, this is not a contended piece of code.
Golly this code looks familiar! I don't know if this was inspired by what we do in delta-rs or if this is convergent evolution 🦀 ! Either way, I think this is a great improvement and the shedding of the dashmap dependency shouldn't be necessary based on the code paths I have seen using our implementation of the same concept. I think this would be a positive change and allow me to shed some code as well 😄 |
I've replaced |
@rtyler Does this PR look like it would meet your needs (i.e. that you could remove your own registry)? |
I opened a DataFusion github issue to ask if they'll adopt this version: apache/datafusion#15983 |
@criccomini I think the approach here could, but as written thus far it only could replace use of |
@rtyler Gotcha. I was a bit hesitant to add It's easy enough to add, though. And since |
@rtyler I've pushed an update that contains a LMK if this meets your needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was misunderstanding how the client usage here might look, the registry contains initialized and configured ObjectStoreRef
s based on the provided Url
. I was crossing the streams in my brain and thinking about the Url scheme too much.
Upon further consideration this would be a bit more limiting than I initially considered. As such it would not replace the factories within delta-rs since there we are needing to register a factory/closure/whatever-you-wanna-call-it which will produce those ObjectStoreRef
objects rather than caching initialized refs in some global state.
I'm pondering this a bit more but the visage of an API isn't coming to me yet 🤔
src/registry.rs
Outdated
|
||
/// Get the key of a url for object store registration. | ||
/// The credential info will be removed | ||
fn get_url_key(url: &Url) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend making this a public URL so that downstream consumers can utilize it for consistency as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like an implementation detail of the DefaultObjectStoreRegistry
to me 🤔 so I am not sure about making it public
If it needs to be public, perhaps it could a method on DefaultObjectStoreRegistry
like:
impl DefaultObjectStoreRegistry {
pub fn url_key(url: &Url) -> String { ... }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, IMO it should be private since it's just used for the default registry. Other registries might want to cache differently.
Yes, the default implementation relies on pre-registering stores. Originally DF also had factory abstractions but this became increasingly unwieldy, and their interactions hard to reason about - especially once keying on more than just scheme and host. This was resolved by extracting the ObjectStoreRegistry as a trait, allowing people full control over how to convert URLs to stores, cached or otherwise. R.e. opts, I'm not sure it makes sense to expose this in the trait, rather I'd expect alternativr implementations of ObjectStoreRegistry to hook up config as appropriate for the use-case. TLDR ObjectStoreRegistry is designed to be a common interface more than a batteries included implementation |
Yep, that was my initial feeling as well. It seemed weird to provide opts in the API calls. If users want to use
@kylebarron yes and no. The registry contains initialized/configured |
This reverts commit 658ba5e.
@kylebarron Looking at the delta-rs code a bit more. It seems like the In particular /// Simpler access pattern for the [ObjectStoreFactoryRegistry] to get a single store
pub fn store_for<K, V, I>(url: &Url, options: I) -> DeltaResult<ObjectStoreRef>
where
I: IntoIterator<Item = (K, V)>,
K: AsRef<str> + Into<String>,
V: AsRef<str> + Into<String>,
{
if let Some(store) = object_store_registry.get_store(&url) {
return store;
} else { // TODO would need to validate scheme is memory/url
let (store, _) = object_store::parse_url_opts(url, options);
Ok(Arc::new(store))
} else {
Err(DeltaTableError::InvalidTableLocation(url.clone().into()))
}
} Admittedly, I didn't dig into the storage config stuff. Just sort of spit-balling to pressure test this PR's API. |
@alamb @tustvold Ok, I think this is ready for (another) review. I've added a few implementations of
Please take a pass and let me know how things look. A few notes for reviewers:
|
I'll try to find some time to review this at the weekend, however, my initial reaction is that this has gotten a lot more complicated than I expected. For example, IMO if people want to be providing custom parser_fn they should just implement the trait... I'm also not sure about some of the additions to the trait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick pass, I'd like to suggest removing the new trait methods as they have quite confusing semantics and result in a very leaky abstraction as this is a lossy translation.
As stated above, implementations should only "materialize" the URL to the store at the point of usage, with plans etc... in terms of URLs not Arc<dyn ObjextStore>
avoiding the need for this functionality.
I think this will also remove the need for the additional impls, as the trait is now trivial for people to just implement as required for their use-case.
I think we should just provide a DefaultObjectStoreRegistry that uses the parse_url machinery with options sourced from the environment. (It may have to use the lower-level ObjectStoreScheme to determine the correct environment prefix to use).
src/registry.rs
Outdated
fn get_store(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>; | ||
|
||
/// Given one of the `Arc<dyn ObjectStore>`s you registered, return its URL. | ||
fn get_prefix(&self, store: Arc<dyn ObjectStore>) -> Option<Url>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a massive fan of this interface, it is hard for stores to implement it efficiently, and is just a bit peculiar.
The pattern used in DF instead is to pass around the URLs and only "materialize" the store at the point of usage
src/registry.rs
Outdated
fn get_prefix(&self, store: Arc<dyn ObjectStore>) -> Option<Url>; | ||
|
||
/// List all registered store prefixes | ||
fn get_store_prefixes(&self) -> Vec<Url>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above, I'm not sure about this method. It isn't well defined what this means, as it is up to the implementation how it wishes to key stores, which makes this abstraction leaky
src/registry.rs
Outdated
/// a URL prefix depends on the [`ObjectStoreRegistry`] implementation. See implementation docs | ||
/// for more details. | ||
pub trait ObjectStoreRegistry: Send + Sync + std::fmt::Debug + 'static { | ||
/// Register a new store for any URL that begins with `prefix` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has subtly changed the behaviour from DF, which delegates it to the registry implementation to determine the keying portion of the URL. I think this needs to be changed back, it was that way for a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs changes prior to merge, in particular it makes changes to the way that URLs behave that would make this abstraction not usable by DF. The goal of ObjectStoreRegistry is effectively to act as cache for parse_url, it is not intended for implementations to be keyed on the full path to an object - rather some implementation-defined subset of that URL.
src/registry.rs
Outdated
/// If no [`ObjectStore`] is found for the `url`, an [`ObjectStore`] may be lazily be | ||
/// created and registered. The logic for doing so is left to each [`ObjectStoreRegistry`] | ||
/// implementation. | ||
fn get_store(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn get_store(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>; | |
fn get_store(&self, url: &Url) -> Result<(Arc<dyn ObjectStore>, Path)>; |
Perhaps, to mirror parse_url and allow for more complex relationships between URLs and paths within stores.
Having had a play, you successfully nerd-sniped me, I've filed #356 as I think that will help implement this - in particular defining what portion of a URL refers to the store and which to the object within the store. |
We do have this actually: https://github.com/delta-io/delta-rs/blob/3d2bbef5cb7c5f6b7e5d1b778edbfd098e993bc3/crates/core/src/logstore/storage/mod.rs#L23 (I grabbed that from datafusion codebase some time ago for the lakefs integration), so good to see it will be part of object-store :) |
🔥 I will update the PR to simplify things. I suspected this might be the outcome, but I wanted to experiment/pressure test the APIs anyway. :) |
@tustvold Ok, I think I've incorporated all of your feedback.
Two other things:
|
I have an alternate proposal here, PTAL - #375 |
Closing in favor of #375 |
This change introduces an
ObjectStoreRegistry
. The rationale is that there are some cases whereObjectStore
s need to be resolved at runtime. There already existsparse_url
to address this need, but it's fairly hard-coded. It doesn't support customObjectStore
implementations and parsing logic, for example.ObjectStoreRegistry
can also act as a nice cache when multiple object stores are used across the codebase. Moreover, it can serve as a way to map URLs to object stores--something that was requested in #347 and #259. Sinceobject_store
does not currently expose aget_url()
method, this is useful in cases where a user wants to go from anObjectStore
back to a URL.The code in this PR starts from DataFusion's
ObjectStoreRegistry
. The following changes are then made:ObjectStoreUrl
list_urls
methodparse_url
to create anObjectStore
onget_store
cache miss (if possible)Fixes #347