Skip to content

Add ObjectStoreRegistry #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

criccomini
Copy link
Contributor

@criccomini criccomini commented May 6, 2025

This change introduces an ObjectStoreRegistry. The rationale is that there are some cases where ObjectStores need to be resolved at runtime. There already exists parse_url to address this need, but it's fairly hard-coded. It doesn't support custom ObjectStore implementations and parsing logic, for example.

ObjectStoreRegistry can also act as a nice cache when multiple object stores are used across the codebase. Moreover, it can serve as a way to map URLs to object stores--something that was requested in #347 and #259. Since object_store does not currently expose a get_url() method, this is useful in cases where a user wants to go from an ObjectStore back to a URL.

The code in this PR starts from DataFusion's ObjectStoreRegistry. The following changes are then made:

  1. Remove ObjectStoreUrl
  2. Add list_urls method
  3. Use parse_url to create an ObjectStore on get_store cache miss (if possible)
  4. Adds registry tests
  5. Minor doc updates

Fixes #347

@criccomini
Copy link
Contributor Author

criccomini commented May 6, 2025

Couple of open questions:

  • How do y'all feel about the docs right now? They're just copied from DF.
  • WDYT about adding a get_url method that takes an object store and compares refs to find the URL for it in the registry.
  • I used dashmap since that's what DF was using. Not sure how you feel about adding the dependency.
  • One of the tests uses a clunk debug fmt thingy to verify the returned instance is of type LocalFileSystem. I couldn't figure out the right way to do this in rust. o4-mini-high failed me.
  • I'm thinking we should have get_url_key be public so that other registry implementations can use it.

@criccomini
Copy link
Contributor Author

Looks like we would need to drop dashmap to 5.5.2 (https://github.com/xacrimon/dashmap/blob/v5.5.2/Cargo.toml) to maintain the MSRV you guys have set (1.64). 6.1.0 requires 1.6.5 rust.

@kylebarron
Copy link
Contributor

Rust 1.65 is 2.5 years old; I don't have a problem with setting that as MSRV (though I don't have final say)

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we gate this by a feature flag given the additional dependency.

I think we also probably want to see how DF might switch to using this formulation, perhaps file a DF issue requesting help if you don't have capacity? I think we should make sure what we're adding here can be used downstream.

Edit: tbh regarding dashmap I'd be inclined to just use RWLock around a HashMap, this is not a contended piece of code.

@rtyler
Copy link
Contributor

rtyler commented May 7, 2025

Golly this code looks familiar! I don't know if this was inspired by what we do in delta-rs or if this is convergent evolution 🦀 !

Either way, I think this is a great improvement and the shedding of the dashmap dependency shouldn't be necessary based on the code paths I have seen using our implementation of the same concept. I think this would be a positive change and allow me to shed some code as well 😄

@criccomini
Copy link
Contributor Author

I've replaced dashmap with an RWLock on a HashMap here: 81a05b0

@criccomini
Copy link
Contributor Author

@rtyler Does this PR look like it would meet your needs (i.e. that you could remove your own registry)?

@criccomini
Copy link
Contributor Author

I opened a DataFusion github issue to ask if they'll adopt this version: apache/datafusion#15983

@rtyler
Copy link
Contributor

rtyler commented May 7, 2025

@criccomini I think the approach here could, but as written thus far it only could replace use of parse_url whereas we're pretty heavily reliant on parse_url_opts, since object store options need to be plumbed through. I think that minor extension to the trait would make our factories code redundant

@criccomini
Copy link
Contributor Author

criccomini commented May 7, 2025

@rtyler Gotcha. I was a bit hesitant to add _opts support since I figured folks would instead implement their own ObjectStoreRegistry rather than use the default. Users can also call parse_url_opts, and pass the store in via register_store.

It's easy enough to add, though. And since _opts is part of the object_store API, I suppose it's not that bad to support. It just felt funny to me since you have to supply the opts every time (unless you know it's a store is already cached for a URL, and decide to skip passing in the opts).

@criccomini
Copy link
Contributor Author

criccomini commented May 7, 2025

@rtyler I've pushed an update that contains a get_store_with_opts method. The default implementation passes the opts to parse_url_opts. get_store calls get_store_with_opts with an empty iterator now.

LMK if this meets your needs.

Copy link
Contributor

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was misunderstanding how the client usage here might look, the registry contains initialized and configured ObjectStoreRefs based on the provided Url. I was crossing the streams in my brain and thinking about the Url scheme too much.

Upon further consideration this would be a bit more limiting than I initially considered. As such it would not replace the factories within delta-rs since there we are needing to register a factory/closure/whatever-you-wanna-call-it which will produce those ObjectStoreRef objects rather than caching initialized refs in some global state.

I'm pondering this a bit more but the visage of an API isn't coming to me yet 🤔

src/registry.rs Outdated

/// Get the key of a url for object store registration.
/// The credential info will be removed
fn get_url_key(url: &Url) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend making this a public URL so that downstream consumers can utilize it for consistency as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like an implementation detail of the DefaultObjectStoreRegistry to me 🤔 so I am not sure about making it public

If it needs to be public, perhaps it could a method on DefaultObjectStoreRegistry like:

impl DefaultObjectStoreRegistry {
  pub fn url_key(url: &Url) -> String { ... }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, IMO it should be private since it's just used for the default registry. Other registries might want to cache differently.

@tustvold
Copy link
Contributor

tustvold commented May 7, 2025

Yes, the default implementation relies on pre-registering stores. Originally DF also had factory abstractions but this became increasingly unwieldy, and their interactions hard to reason about - especially once keying on more than just scheme and host. This was resolved by extracting the ObjectStoreRegistry as a trait, allowing people full control over how to convert URLs to stores, cached or otherwise.

R.e. opts, I'm not sure it makes sense to expose this in the trait, rather I'd expect alternativr implementations of ObjectStoreRegistry to hook up config as appropriate for the use-case.

TLDR ObjectStoreRegistry is designed to be a common interface more than a batteries included implementation

@criccomini
Copy link
Contributor Author

criccomini commented May 8, 2025

R.e. opts, I'm not sure it makes sense to expose this in the trait, rather I'd expect alternativr implementations of ObjectStoreRegistry to hook up config as appropriate for the use-case.

Yep, that was my initial feeling as well. It seemed weird to provide opts in the API calls. If users want to use parse_url_opts, that's fine; they can either call register_store with the ObjectStore they created from parse_url_opts. Though I suppose going that route changes the registry into more of a cache.

the registry contains initialized and configured ObjectStoreRefs based on the provided Url

@kylebarron yes and no. The registry contains initialized/configured ObjectStoreRefs. It does not necessarily require them to be generated from the Url since you may register the ObjectStore directly using register_store. That method allows you to initialize the stores however you want (for example, using parse_url_opts). The default implementation simply uses parse_url.

@criccomini
Copy link
Contributor Author

@kylebarron Looking at the delta-rs code a bit more. It seems like the ObjectStoreFactoryRegistry stuff is perhaps a bit closer to what the ObjectStoreRegistry does in this PR:

https://github.com/delta-io/delta-rs/blob/3d2bbef5cb7c5f6b7e5d1b778edbfd098e993bc3/crates/core/src/logstore/factories.rs#L64-L94

In particular store_for looks pretty close. I wonder if something like this would work:

/// Simpler access pattern for the [ObjectStoreFactoryRegistry] to get a single store
pub fn store_for<K, V, I>(url: &Url, options: I) -> DeltaResult<ObjectStoreRef>
where
    I: IntoIterator<Item = (K, V)>,
    K: AsRef<str> + Into<String>,
    V: AsRef<str> + Into<String>,
{
    if let Some(store) = object_store_registry.get_store(&url) {
        return store;
    } else { // TODO would need to validate scheme is memory/url
        let (store, _) = object_store::parse_url_opts(url, options);
        Ok(Arc::new(store))
    } else {
        Err(DeltaTableError::InvalidTableLocation(url.clone().into()))
    }
}

Admittedly, I didn't dig into the storage config stuff. Just sort of spit-balling to pressure test this PR's API.

@criccomini
Copy link
Contributor Author

criccomini commented May 13, 2025

@alamb @tustvold Ok, I think this is ready for (another) review. I've added a few implementations of ObjectStoreRegistry that (I think) should get most users what they want. There are now three:

  • DefaultObjectStoreRegistry: A very dumb registry that simply treats URLs in get_store calls as the prefix.
  • PrefixObjectStoreRegistry: A registry that wraps DefaultObjectStoreRegistry and calls a prefix_fn on all get_store URLs. The default implementation uses DataFusion's get_url_key implementation style.
  • ParserObjectStoreRegistry: A registry that wraps PrefixObjectStoreRegistry and calls a parser_fn on any get_store calls that aren't set in the wrapped (prefix) registry. The parser_fn defaults to using the parse_url method. If the parser_fn returns an object store, it's cached using the prefix that was resolved in get_store. If a user wants to use custom opts for different URL prefixes, they can supply their own parser_fn. See the tests and docs for examples that use parse_url_opts. This should address @rtyler's use case; it's also something we'll need for SlateDB.

Please take a pass and let me know how things look. A few notes for reviewers:

  1. I'm not 100% on my Rust foo, so pay attention to silly things I might have done with Boxes, Arcs, and so on.
  2. If y'all feel like this is too much, let me know. I can remove the prefix and/or parser implementations.
  3. I found that test_url_http wasn't working as expected when I copied its code over to test the ParserObjectStoreRegistry on an http URL. I decided to fix the test as part of this PR.
  4. Seems like there's a test failure on some transient test for aws. Shouldn't be affected by my code.

@tustvold
Copy link
Contributor

I'll try to find some time to review this at the weekend, however, my initial reaction is that this has gotten a lot more complicated than I expected.

For example, IMO if people want to be providing custom parser_fn they should just implement the trait...

I'm also not sure about some of the additions to the trait.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick pass, I'd like to suggest removing the new trait methods as they have quite confusing semantics and result in a very leaky abstraction as this is a lossy translation.

As stated above, implementations should only "materialize" the URL to the store at the point of usage, with plans etc... in terms of URLs not Arc<dyn ObjextStore> avoiding the need for this functionality.

I think this will also remove the need for the additional impls, as the trait is now trivial for people to just implement as required for their use-case.

I think we should just provide a DefaultObjectStoreRegistry that uses the parse_url machinery with options sourced from the environment. (It may have to use the lower-level ObjectStoreScheme to determine the correct environment prefix to use).

src/registry.rs Outdated
fn get_store(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>;

/// Given one of the `Arc<dyn ObjectStore>`s you registered, return its URL.
fn get_prefix(&self, store: Arc<dyn ObjectStore>) -> Option<Url>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a massive fan of this interface, it is hard for stores to implement it efficiently, and is just a bit peculiar.

The pattern used in DF instead is to pass around the URLs and only "materialize" the store at the point of usage

src/registry.rs Outdated
fn get_prefix(&self, store: Arc<dyn ObjectStore>) -> Option<Url>;

/// List all registered store prefixes
fn get_store_prefixes(&self) -> Vec<Url>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, I'm not sure about this method. It isn't well defined what this means, as it is up to the implementation how it wishes to key stores, which makes this abstraction leaky

src/registry.rs Outdated
/// a URL prefix depends on the [`ObjectStoreRegistry`] implementation. See implementation docs
/// for more details.
pub trait ObjectStoreRegistry: Send + Sync + std::fmt::Debug + 'static {
/// Register a new store for any URL that begins with `prefix`
Copy link
Contributor

@tustvold tustvold May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has subtly changed the behaviour from DF, which delegates it to the registry implementation to determine the keying portion of the URL. I think this needs to be changed back, it was that way for a reason.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs changes prior to merge, in particular it makes changes to the way that URLs behave that would make this abstraction not usable by DF. The goal of ObjectStoreRegistry is effectively to act as cache for parse_url, it is not intended for implementations to be keyed on the full path to an object - rather some implementation-defined subset of that URL.

src/registry.rs Outdated
/// If no [`ObjectStore`] is found for the `url`, an [`ObjectStore`] may be lazily be
/// created and registered. The logic for doing so is left to each [`ObjectStoreRegistry`]
/// implementation.
fn get_store(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn get_store(&self, url: &Url) -> Option<Arc<dyn ObjectStore>>;
fn get_store(&self, url: &Url) -> Result<(Arc<dyn ObjectStore>, Path)>;

Perhaps, to mirror parse_url and allow for more complex relationships between URLs and paths within stores.

@tustvold
Copy link
Contributor

Having had a play, you successfully nerd-sniped me, I've filed #356 as I think that will help implement this - in particular defining what portion of a URL refers to the store and which to the object within the store.

@ion-elgreco
Copy link
Contributor

I think I was misunderstanding how the client usage here might look, the registry contains initialized and configured ObjectStoreRefs based on the provided Url. I was crossing the streams in my brain and thinking about the Url scheme too much.

Upon further consideration this would be a bit more limiting than I initially considered. As such it would not replace the factories within delta-rs since there we are needing to register a factory/closure/whatever-you-wanna-call-it which will produce those ObjectStoreRef objects rather than caching initialized refs in some global state.

I'm pondering this a bit more but the visage of an API isn't coming to me yet 🤔

We do have this actually: https://github.com/delta-io/delta-rs/blob/3d2bbef5cb7c5f6b7e5d1b778edbfd098e993bc3/crates/core/src/logstore/storage/mod.rs#L23 (I grabbed that from datafusion codebase some time ago for the lakefs integration), so good to see it will be part of object-store :)

@criccomini
Copy link
Contributor Author

🔥 I will update the PR to simplify things. I suspected this might be the outcome, but I wanted to experiment/pressure test the APIs anyway. :)

@criccomini
Copy link
Contributor Author

@tustvold Ok, I think I've incorporated all of your feedback.

  • Removed get_store_url
  • Removed PrefixObjectStoreRegistry/ParserObjectStoreRegistry
  • Updated DefaultObjectStoreRegistry to use parse_opts and added a map_url_to_key method.

Two other things:

  1. The map_url_to_key adheres to similar rules to ObjectStoreScheme::parse to make it work with memory/file schemes. This seemed right to me for the time being. Once Add ObjectStoreUrl to resolve URLs to ObjectStore instances #356 is done, my code should probably use ObjectStoreScheme::parse directly.
  2. I did not incorporate any of the environment parsing stuff you mentioned in Add ObjectStoreRegistry #348 (review). TBH, I feel that parse_url_opts is a better place for that (rather than using std::iter::empty::<(&str, &str)>() when it invokes parse_url_opts).

@criccomini criccomini mentioned this pull request May 14, 2025
@tustvold
Copy link
Contributor

I have an alternate proposal here, PTAL - #375

@criccomini
Copy link
Contributor Author

Closing in favor of #375

@criccomini criccomini closed this May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is there a way to go from ObjectStore to (URL, opts)?
6 participants