Add ObjectStoreRegistry (#347) #375

tustvold · 2025-05-17T18:35:42Z

Which issue does this PR close?

Rationale for this change

See tickets, but the TLDR is we want to provide a flexible way to allow people to register and manage ObjectStore based on URL.

What changes are included in this PR?

Adds an ObjectStoreRegistry and a default implementation

Are there any user-facing changes?

tustvold · 2025-05-17T18:35:55Z

src/parse.rs

+            (
+                "s3://bucket/foo bar",
+                (ObjectStoreScheme::AmazonS3, "foo bar"),
+            ),
+            ("s3://bucket/😀", (ObjectStoreScheme::AmazonS3, "😀")),
+            (
+                "s3://bucket/%F0%9F%98%80",
+                (ObjectStoreScheme::AmazonS3, "😀"),
+            ),


This is an unrelated test I wrote whilst playing around

tustvold · 2025-05-17T18:37:47Z

src/registry.rs

+
+    /// Resolve an object URL
+    ///
+    /// If [`ObjectStoreRegistry::register`] has been called with a URL with the same


The original formulation in DF simply uses scheme and host. This works well for bucket specific URLs like s3://bucket/path but it falls apart for URLs like https://s3.region.amazonaws.com/bucket.

The challenge is to come up with a mechanism to support keying on more than scheme and host, without requiring the registry to actually understand the components of that URL. This is the formulation I came up with.

Unlike #356 it avoids the store needing to actually understand the content of the URL, making this significantly more flexible.

tustvold · 2025-05-17T18:51:55Z

src/registry.rs

+                .range::<str, _>((Bound::Included(start), Bound::Unbounded))
+                .take_while(|&(base, _)| &base.0[..url::Position::BeforePath] == start);
+
+            let mut longest_len = 0;


This is the downside of this formulation, we have to use a BTreeMap which is slower than a HashMap, and if there are multiple (scheme, host) matches we have to scan through them.

tustvold · 2025-05-17T19:02:13Z

I actually plan to tweak this to do path matching on a path segment basis

tustvold · 2025-05-17T21:00:41Z

src/registry.rs

+    /// Lookup a store based on URL path
+    ///
+    /// Returns the store and its path segment depth
+    fn lookup(&self, to_resolve: &Url) -> Option<(&Arc<dyn ObjectStore>, usize)> {


This does mean that if people register really long paths, this process will be rather inefficient, however, in most cases I think this should be acceptable.

criccomini

This looks excellent. I think it will work for our SlateDB use case!

criccomini · 2025-05-17T23:30:50Z

src/registry.rs

+    /// Resolve an object URL
+    ///
+    /// If [`ObjectStoreRegistry::register`] has been called with a URL with the same
+    /// scheme and host as the object URL, and a path that is a prefix of the object URL's


nit: scheme/host/port, right?

criccomini · 2025-05-17T23:31:17Z

src/registry.rs

+    ///
+    /// If [`ObjectStoreRegistry::register`] has been called with a URL with the same
+    /// scheme and host as the object URL, and a path that is a prefix of the object URL's
+    /// it should be returned with along with the trailing path. Paths should be matched


Suggested change

/// it should be returned with along with the trailing path. Paths should be matched

/// it should be returned along with the trailing path. Paths should be matched

criccomini · 2025-05-17T23:33:47Z

src/registry.rs

+    /// assert_eq!(path.as_ref(), "path/to/object");
+    /// assert!(Arc::ptr_eq(&ret, &bucket2));
+    ///
+    /// let bucket3 = Arc::new(PrefixStore::new(InMemory::new(), "path")) as Arc<dyn ObjectStore>;


criccomini · 2025-05-17T23:46:30Z

src/registry.rs

+        for segments in path_segments(url.path()) {
+            entry = entry.children.entry(segments.to_string()).or_default();


Suggested change

for segments in path_segments(url.path()) {

entry = entry.children.entry(segments.to_string()).or_default();

for segment in path_segments(url.path()) {

entry = entry.children.entry(segment.to_string()).or_default();

criccomini · 2025-05-17T23:48:37Z

src/registry.rs

+    children: HashMap<String, Self>,
+}
+
+impl PathEntry {


Nit: Some docs on how this works would be helpful. It's pretty slick, but it took me a few mins to wrap my head around.

I've added some docs

criccomini · 2025-05-17T23:50:18Z

src/registry.rs

+                if let Some(store) = &current.store {
+                    ret = Some((store, depth))
+                }
+            }


Should there be an else break here (or a while loop instead)? Once an entry doesn't have a match for the segment, I think we're done, right?

Consider the case of the following stores

memory:///

memory:///1/2/3

And resolving memory:///1/2/3/4, if we broke we would return the wrong answer.

I will add this as a test

To clarify, I'm not saying when we find a child with a store. I'm saying when we find that there's no child for the given segment. Doesn't this code iterate over path segments? So can't we break as soon as we find that a path entry has no child for the segment?

In the example you showed above, we would break since 4 isn't a child of 3 and we'd return the proper store.

Oh, yes we can 👍

(This actually was a bug, nice spot)

criccomini · 2025-05-17T23:51:34Z

src/registry.rs

+            }
+        }
+
+        if let Ok((store, path)) = parse_url_opts(to_resolve, std::env::vars()) {


criccomini · 2025-05-17T23:57:49Z

src/registry.rs

+        if let Ok((store, path)) = parse_url_opts(to_resolve, std::env::vars()) {
+            let depth = num_segments(to_resolve.path()) - num_segments(path.as_ref());
+
+            let mut map = self.map.write();


Is it possible to re-use the logic in register() here? Seems like duplicate logic.

Good point, I forgot to handle the race which is why this code is separate

criccomini · 2025-05-17T23:59:56Z

src/registry.rs

+}
+
+/// Returns the non-empty segments of a path
+fn path_segments(s: &str) -> impl Iterator<Item = &str> {


Curious about the choice to implement this vs Url::path_segments

We just want the non-empty path segments (as Path strips them)

tustvold · 2025-05-18T20:22:38Z

src/registry.rs

+
+/// Extracts the scheme and authority of a URL (components before the Path)
+fn url_key(url: &Url) -> &str {
+    &url[..url::Position::AfterPort]


It should be noted this will consider credentials unlike the version in DF - IMO this keeps things simple, and ultimately people shouldn't really be putting credentials in URLs

This just makes me nervous because what people should do and what they will do are two different things. I worry this will lead to leaked credentials and security incidents. Could we at least .warn() if we see it?

Ok, so imagine we do mask out passwords, now imagine a user fat fingers the password the first time, they then try again with the fixed password, but because the URL has been cached with the incorrect password it will continue to use the old version.

Whilst I accept passwords in URLs are bad, I also can't help feeling masking them introduces somewhat surprising behaviour for relatively limited security improvement - ultimately the user is typing a credential into a SQL query or similar, it's gonna get logged somewhere 😅

I'm suggesting leaving the code as you have it but calling a WARN telling the user they shouldn't be doing this.

IMO a utility crate like object_store is the wrong level for such a warning to be printed, we will get people complaining about it and then wanting to disable it, e.g. if they're user provided URLs that are being used or something... I'd expect such a warning to be implemented as part of whatever frontend users are interacting with

I think given this is a trait too people can potentially implement their own version if they have special needs

criccomini

looks good!

alamb

This is looking good to me. I am happy to have @criccomini 's review (made my review strightforward)

Thank you @tustvold and @criccomini

alamb · 2025-05-19T19:50:12Z

src/registry.rs

+    }
+}
+
+/// An [`ObjectStoreRegistry`] that uses [`parse_url_opts`] to create stores based on the environment


Can we please document what "based on the environment" means here? Ideally with an example

Basically point out that it will try and instantiate one of the built in stores and pass std::env::vars as options.

I also tried to improve the docs on parse_url_opts in a separate PR:

Improve parse_url_opts documentation #377

src/registry.rs

alamb · 2025-05-19T20:06:12Z

src/registry.rs

+
+/// Extracts the scheme and authority of a URL (components before the Path)
+fn url_key(url: &Url) -> &str {
+    &url[..url::Position::AfterPort]


I think given this is a trait too people can potentially implement their own version if they have special needs

tustvold commented May 17, 2025

View reviewed changes

Add ObjectStoreRegistry (apache#347)

7f64cba

tustvold force-pushed the add-object-store-registry branch from 6c7b5b7 to 7f64cba Compare May 17, 2025 18:48

tustvold commented May 17, 2025

View reviewed changes

tustvold marked this pull request as draft May 17, 2025 19:02

tustvold added 2 commits May 17, 2025 21:55

Make path segment based

0823ed1

Fix doc

f0c8842

This was referenced May 17, 2025

Add ObjectStoreRegistry #348

Closed

Add ObjectStoreUrl to resolve URLs to ObjectStore instances #356

Closed

tustvold marked this pull request as ready for review May 17, 2025 20:59

tustvold commented May 17, 2025

View reviewed changes

tustvold mentioned this pull request May 17, 2025

Add ObjectStoreUrl #362

Closed

criccomini reviewed May 18, 2025

View reviewed changes

tustvold added 3 commits May 18, 2025 20:52

Additional test

cc630ef

Handle race

3977a29

Review feedback

a348773

tustvold commented May 18, 2025

View reviewed changes

Fix prefix bug

aa9bec8

criccomini approved these changes May 18, 2025

View reviewed changes

alamb mentioned this pull request May 19, 2025

Improve parse_url_opts documentation #377

Merged

alamb approved these changes May 19, 2025

View reviewed changes

tustvold mentioned this pull request May 20, 2025

Make FileIO a Trait apache/iceberg-rust#1314

Open

Review feedback

a3110f5

tustvold merged commit 72088de into apache:main May 25, 2025
8 checks passed

weiji14 mentioned this pull request May 26, 2025

Extension tag registry developmentseed/async-tiff#100

Open

	/// it should be returned with along with the trailing path. Paths should be matched
	/// it should be returned along with the trailing path. Paths should be matched

		for segments in path_segments(url.path()) {
		entry = entry.children.entry(segments.to_string()).or_default();

Add ObjectStoreRegistry (#347) #375

Add ObjectStoreRegistry (#347) #375

Uh oh!

Conversation

tustvold commented May 17, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold commented May 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

criccomini left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

criccomini May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

criccomini May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

criccomini left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

tustvold commented May 17, 2025 •

edited by alamb

Loading

tustvold May 17, 2025 •

edited

Loading

criccomini May 18, 2025 •

edited

Loading

tustvold May 18, 2025 •

edited

Loading

criccomini May 18, 2025 •

edited

Loading