Skip to content

Add ObjectStoreRegistry (#347) #375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 25, 2025

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 17, 2025

Which issue does this PR close?

Rationale for this change

See tickets, but the TLDR is we want to provide a flexible way to allow people to register and manage ObjectStore based on URL.

What changes are included in this PR?

Adds an ObjectStoreRegistry and a default implementation

Are there any user-facing changes?

Comment on lines +289 to +297
(
"s3://bucket/foo bar",
(ObjectStoreScheme::AmazonS3, "foo bar"),
),
("s3://bucket/😀", (ObjectStoreScheme::AmazonS3, "😀")),
(
"s3://bucket/%F0%9F%98%80",
(ObjectStoreScheme::AmazonS3, "😀"),
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated test I wrote whilst playing around


/// Resolve an object URL
///
/// If [`ObjectStoreRegistry::register`] has been called with a URL with the same
Copy link
Contributor Author

@tustvold tustvold May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original formulation in DF simply uses scheme and host. This works well for bucket specific URLs like s3://bucket/path but it falls apart for URLs like https://s3.region.amazonaws.com/bucket.

The challenge is to come up with a mechanism to support keying on more than scheme and host, without requiring the registry to actually understand the components of that URL. This is the formulation I came up with.

Unlike #356 it avoids the store needing to actually understand the content of the URL, making this significantly more flexible.

@tustvold tustvold force-pushed the add-object-store-registry branch from 6c7b5b7 to 7f64cba Compare May 17, 2025 18:48
src/registry.rs Outdated
.range::<str, _>((Bound::Included(start), Bound::Unbounded))
.take_while(|&(base, _)| &base.0[..url::Position::BeforePath] == start);

let mut longest_len = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the downside of this formulation, we have to use a BTreeMap which is slower than a HashMap, and if there are multiple (scheme, host) matches we have to scan through them.

@tustvold
Copy link
Contributor Author

I actually plan to tweak this to do path matching on a path segment basis

@tustvold tustvold marked this pull request as draft May 17, 2025 19:02
@tustvold tustvold marked this pull request as ready for review May 17, 2025 20:59
/// Lookup a store based on URL path
///
/// Returns the store and its path segment depth
fn lookup(&self, to_resolve: &Url) -> Option<(&Arc<dyn ObjectStore>, usize)> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does mean that if people register really long paths, this process will be rather inefficient, however, in most cases I think this should be acceptable.

@tustvold tustvold mentioned this pull request May 17, 2025
Copy link
Contributor

@criccomini criccomini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks excellent. I think it will work for our SlateDB use case!

src/registry.rs Outdated
/// Resolve an object URL
///
/// If [`ObjectStoreRegistry::register`] has been called with a URL with the same
/// scheme and host as the object URL, and a path that is a prefix of the object URL's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: scheme/host/port, right?

src/registry.rs Outdated
///
/// If [`ObjectStoreRegistry::register`] has been called with a URL with the same
/// scheme and host as the object URL, and a path that is a prefix of the object URL's
/// it should be returned with along with the trailing path. Paths should be matched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// it should be returned with along with the trailing path. Paths should be matched
/// it should be returned along with the trailing path. Paths should be matched

/// assert_eq!(path.as_ref(), "path/to/object");
/// assert!(Arc::ptr_eq(&ret, &bucket2));
///
/// let bucket3 = Arc::new(PrefixStore::new(InMemory::new(), "path")) as Arc<dyn ObjectStore>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiiiice

src/registry.rs Outdated
Comment on lines 153 to 154
for segments in path_segments(url.path()) {
entry = entry.children.entry(segments.to_string()).or_default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for segments in path_segments(url.path()) {
entry = entry.children.entry(segments.to_string()).or_default();
for segment in path_segments(url.path()) {
entry = entry.children.entry(segment.to_string()).or_default();

children: HashMap<String, Self>,
}

impl PathEntry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Some docs on how this works would be helpful. It's pretty slick, but it took me a few mins to wrap my head around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some docs

if let Some(store) = &current.store {
ret = Some((store, depth))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an else break here (or a while loop instead)? Once an entry doesn't have a match for the segment, I think we're done, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the case of the following stores

  • memory:///
  • memory:///1/2/3

And resolving memory:///1/2/3/4, if we broke we would return the wrong answer.

I will add this as a test

Copy link
Contributor

@criccomini criccomini May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I'm not saying when we find a child with a store. I'm saying when we find that there's no child for the given segment. Doesn't this code iterate over path segments? So can't we break as soon as we find that a path entry has no child for the segment?

In the example you showed above, we would break since 4 isn't a child of 3 and we'd return the proper store.

Copy link
Contributor Author

@tustvold tustvold May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes we can 👍

(This actually was a bug, nice spot)

}
}

if let Ok((store, path)) = parse_url_opts(to_resolve, std::env::vars()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩

if let Ok((store, path)) = parse_url_opts(to_resolve, std::env::vars()) {
let depth = num_segments(to_resolve.path()) - num_segments(path.as_ref());

let mut map = self.map.write();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to re-use the logic in register() here? Seems like duplicate logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I forgot to handle the race which is why this code is separate

}

/// Returns the non-empty segments of a path
fn path_segments(s: &str) -> impl Iterator<Item = &str> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the choice to implement this vs Url::path_segments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just want the non-empty path segments (as Path strips them)


/// Extracts the scheme and authority of a URL (components before the Path)
fn url_key(url: &Url) -> &str {
&url[..url::Position::AfterPort]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be noted this will consider credentials unlike the version in DF - IMO this keeps things simple, and ultimately people shouldn't really be putting credentials in URLs

Copy link
Contributor

@criccomini criccomini May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just makes me nervous because what people should do and what they will do are two different things. I worry this will lead to leaked credentials and security incidents. Could we at least .warn() if we see it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so imagine we do mask out passwords, now imagine a user fat fingers the password the first time, they then try again with the fixed password, but because the URL has been cached with the incorrect password it will continue to use the old version.

Whilst I accept passwords in URLs are bad, I also can't help feeling masking them introduces somewhat surprising behaviour for relatively limited security improvement - ultimately the user is typing a credential into a SQL query or similar, it's gonna get logged somewhere 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting leaving the code as you have it but calling a WARN telling the user they shouldn't be doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO a utility crate like object_store is the wrong level for such a warning to be printed, we will get people complaining about it and then wanting to disable it, e.g. if they're user provided URLs that are being used or something... I'd expect such a warning to be implemented as part of whatever frontend users are interacting with

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given this is a trait too people can potentially implement their own version if they have special needs

Copy link
Contributor

@criccomini criccomini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: looks good!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. I am happy to have @criccomini 's review (made my review strightforward)

Thank you @tustvold and @criccomini

}
}

/// An [`ObjectStoreRegistry`] that uses [`parse_url_opts`] to create stores based on the environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please document what "based on the environment" means here? Ideally with an example

Basically point out that it will try and instantiate one of the built in stores and pass std::env::vars as options.

I also tried to improve the docs on parse_url_opts in a separate PR:


/// Extracts the scheme and authority of a URL (components before the Path)
fn url_key(url: &Url) -> &str {
&url[..url::Position::AfterPort]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given this is a trait too people can potentially implement their own version if they have special needs

@tustvold tustvold merged commit 72088de into apache:main May 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ObjectStoreUrl to resolve URLs to ObjectStore instances Is there a way to go from ObjectStore to (URL, opts)?
3 participants