-
Notifications
You must be signed in to change notification settings - Fork 76
Add ObjectStoreRegistry (#347) #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
( | ||
"s3://bucket/foo bar", | ||
(ObjectStoreScheme::AmazonS3, "foo bar"), | ||
), | ||
("s3://bucket/😀", (ObjectStoreScheme::AmazonS3, "😀")), | ||
( | ||
"s3://bucket/%F0%9F%98%80", | ||
(ObjectStoreScheme::AmazonS3, "😀"), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated test I wrote whilst playing around
|
||
/// Resolve an object URL | ||
/// | ||
/// If [`ObjectStoreRegistry::register`] has been called with a URL with the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original formulation in DF simply uses scheme and host. This works well for bucket specific URLs like s3://bucket/path
but it falls apart for URLs like https://s3.region.amazonaws.com/bucket
.
The challenge is to come up with a mechanism to support keying on more than scheme and host, without requiring the registry to actually understand the components of that URL. This is the formulation I came up with.
Unlike #356 it avoids the store needing to actually understand the content of the URL, making this significantly more flexible.
6c7b5b7
to
7f64cba
Compare
src/registry.rs
Outdated
.range::<str, _>((Bound::Included(start), Bound::Unbounded)) | ||
.take_while(|&(base, _)| &base.0[..url::Position::BeforePath] == start); | ||
|
||
let mut longest_len = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the downside of this formulation, we have to use a BTreeMap which is slower than a HashMap, and if there are multiple (scheme, host) matches we have to scan through them.
I actually plan to tweak this to do path matching on a path segment basis |
/// Lookup a store based on URL path | ||
/// | ||
/// Returns the store and its path segment depth | ||
fn lookup(&self, to_resolve: &Url) -> Option<(&Arc<dyn ObjectStore>, usize)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does mean that if people register really long paths, this process will be rather inefficient, however, in most cases I think this should be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks excellent. I think it will work for our SlateDB use case!
src/registry.rs
Outdated
/// Resolve an object URL | ||
/// | ||
/// If [`ObjectStoreRegistry::register`] has been called with a URL with the same | ||
/// scheme and host as the object URL, and a path that is a prefix of the object URL's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: scheme/host/port, right?
src/registry.rs
Outdated
/// | ||
/// If [`ObjectStoreRegistry::register`] has been called with a URL with the same | ||
/// scheme and host as the object URL, and a path that is a prefix of the object URL's | ||
/// it should be returned with along with the trailing path. Paths should be matched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// it should be returned with along with the trailing path. Paths should be matched | |
/// it should be returned along with the trailing path. Paths should be matched |
/// assert_eq!(path.as_ref(), "path/to/object"); | ||
/// assert!(Arc::ptr_eq(&ret, &bucket2)); | ||
/// | ||
/// let bucket3 = Arc::new(PrefixStore::new(InMemory::new(), "path")) as Arc<dyn ObjectStore>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiiiice
src/registry.rs
Outdated
for segments in path_segments(url.path()) { | ||
entry = entry.children.entry(segments.to_string()).or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for segments in path_segments(url.path()) { | |
entry = entry.children.entry(segments.to_string()).or_default(); | |
for segment in path_segments(url.path()) { | |
entry = entry.children.entry(segment.to_string()).or_default(); |
children: HashMap<String, Self>, | ||
} | ||
|
||
impl PathEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Some docs on how this works would be helpful. It's pretty slick, but it took me a few mins to wrap my head around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some docs
if let Some(store) = ¤t.store { | ||
ret = Some((store, depth)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an else break here (or a while loop instead)? Once an entry doesn't have a match for the segment, I think we're done, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the case of the following stores
memory:///
memory:///1/2/3
And resolving memory:///1/2/3/4
, if we broke we would return the wrong answer.
I will add this as a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I'm not saying when we find a child with a store. I'm saying when we find that there's no child for the given segment. Doesn't this code iterate over path segments? So can't we break as soon as we find that a path entry has no child for the segment?
In the example you showed above, we would break since 4 isn't a child of 3 and we'd return the proper store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes we can 👍
(This actually was a bug, nice spot)
} | ||
} | ||
|
||
if let Ok((store, path)) = parse_url_opts(to_resolve, std::env::vars()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤩
if let Ok((store, path)) = parse_url_opts(to_resolve, std::env::vars()) { | ||
let depth = num_segments(to_resolve.path()) - num_segments(path.as_ref()); | ||
|
||
let mut map = self.map.write(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to re-use the logic in register()
here? Seems like duplicate logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I forgot to handle the race which is why this code is separate
} | ||
|
||
/// Returns the non-empty segments of a path | ||
fn path_segments(s: &str) -> impl Iterator<Item = &str> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about the choice to implement this vs Url::path_segments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just want the non-empty path segments (as Path strips them)
|
||
/// Extracts the scheme and authority of a URL (components before the Path) | ||
fn url_key(url: &Url) -> &str { | ||
&url[..url::Position::AfterPort] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be noted this will consider credentials unlike the version in DF - IMO this keeps things simple, and ultimately people shouldn't really be putting credentials in URLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just makes me nervous because what people should do and what they will do are two different things. I worry this will lead to leaked credentials and security incidents. Could we at least .warn() if we see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so imagine we do mask out passwords, now imagine a user fat fingers the password the first time, they then try again with the fixed password, but because the URL has been cached with the incorrect password it will continue to use the old version.
Whilst I accept passwords in URLs are bad, I also can't help feeling masking them introduces somewhat surprising behaviour for relatively limited security improvement - ultimately the user is typing a credential into a SQL query or similar, it's gonna get logged somewhere 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting leaving the code as you have it but calling a WARN telling the user they shouldn't be doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO a utility crate like object_store is the wrong level for such a warning to be printed, we will get people complaining about it and then wanting to disable it, e.g. if they're user provided URLs that are being used or something... I'd expect such a warning to be implemented as part of whatever frontend users are interacting with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given this is a trait too people can potentially implement their own version if they have special needs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. I am happy to have @criccomini 's review (made my review strightforward)
Thank you @tustvold and @criccomini
} | ||
} | ||
|
||
/// An [`ObjectStoreRegistry`] that uses [`parse_url_opts`] to create stores based on the environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please document what "based on the environment" means here? Ideally with an example
Basically point out that it will try and instantiate one of the built in stores and pass std::env::vars
as options.
I also tried to improve the docs on parse_url_opts
in a separate PR:
|
||
/// Extracts the scheme and authority of a URL (components before the Path) | ||
fn url_key(url: &Url) -> &str { | ||
&url[..url::Position::AfterPort] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given this is a trait too people can potentially implement their own version if they have special needs
Which issue does this PR close?
ObjectStore
to(URL, opts)
? #347ObjectStoreUrl
to resolve URLs toObjectStore
instances #356Rationale for this change
See tickets, but the TLDR is we want to provide a flexible way to allow people to register and manage ObjectStore based on URL.
What changes are included in this PR?
Adds an ObjectStoreRegistry and a default implementation
Are there any user-facing changes?