Optimize the location overlap check with an index #1686

eric-maynard · 2025-05-27T23:34:16Z

The location overlap check for "sibling" tables (those which share a parent) has been a performance bottleneck since its introduction, but we haven't historically had a good way around this other than just disabling the check.

Current Behavior

The current logic is that when we create a table, we list all sibling tables and check each and every one to ensure there is no location overlap. This results in O(N^2) checks when adding N tables to a namespace, quickly becoming untenable.

With the CreateTreeDataset benchmark I tested creating 5000 sibling tables using the current code:

It is apparent that latency increases over time. Runs took between 90 and 200+ seconds, and Polaris instances with a small memory allocation were prone to crashing due to OOMs:

Proposed change

This PR adds a new persistence API, hasOverlappingSiblings, which if implemented can be used to directly check for the presence of siblings at the metastore layer.

This API is implemented for the JDBC metastore in a new schema version, and some changes are made to account for an evolving schema version now and in the future.

This implementation breaks a location down into components and queries for a sibling at each of those locations, so a new table at location s3://bucket/root/n1/nA/t1/ will require checking for an entity with location s3://bucket/, s3://bucket/root/, s3://bucket/root/n1/, s3://bucket/root/n1/nA/, and finally s3://bucket/root/n1/nA/t1/%. All of this can be done in a single query which makes a single pass over the data.

The query is optimized by the introduction of a new index over a new location column.

With the changes enabled, I tested creating 5000 sibling tables:

Latency is stable over time, and runs consistently completed in less than 30 seconds. I did not observe any OOMs when testing with the feature enabled.

…heck

singhpk234

I think this is a great improvement !

singhpk234 · 2025-05-29T17:43:17Z

...c/src/main/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGenerator.java

+  public static String generateOverlapQuery(String realmId, long parentId, String location) {
+    String[] components = location.split("/");
+    StringBuilder locationClauseBuilder = new StringBuilder();
+    StringBuilder pathBuilder = new StringBuilder();
+    for (String component : components) {
+      pathBuilder.append(component).append("/");
+      locationClauseBuilder.append(String.format(" OR location = '%s'", pathBuilder));
+    }
+    locationClauseBuilder.append(String.format(" OR location LIKE '%s%%'", location));
+    String query = "SELECT " + String.join(", ", new ModelEntity().toMap().keySet());
+
+    // TODO harden against realmId in this method and others
+    return query
+        + String.format(
+            " FROM POLARIS_SCHEMA.entities WHERE realm_id = '%s' AND parent_id = %d AND (1 = 2%s)",
+            realmId, parentId, locationClauseBuilder);
+  }


does this means we now don't need to load table from persistence and can just work directly without persistence.

We don't need to load the table if all we want to do is the sibling check, yeah. I select only the location for logging purposes

singhpk234 · 2025-05-29T17:50:23Z

...n/java/org/apache/polaris/extension/persistence/relational/jdbc/JdbcBasePersistenceImpl.java

+    if (this.version < 2) {
+      return Optional.empty();
+    }


we haven't release v1 yet, if we can get your pr in we might not require v2

Agreed just wanted to illustrate how we can use the version table

...c/src/main/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGenerator.java

...c/main/java/org/apache/polaris/extension/persistence/relational/jdbc/models/ModelEntity.java

extension/persistence/relational-jdbc/src/main/resources/h2/schema-v2.sql

adnanhemani · 2025-05-29T19:12:17Z

...c/test/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGeneratorTest.java

@@ -45,7 +45,7 @@ void testGenerateSelectQuery_withMapWhereClause() {
    whereClause.put("name", "testEntity");
    whereClause.put("entity_version", 1);
    String expectedQuery =
-        "SELECT entity_version, to_purge_timestamp, internal_properties, catalog_id, purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, name, id, drop_timestamp, properties, grant_records_version, type_code FROM POLARIS_SCHEMA.ENTITIES WHERE entity_version = 1 AND name = 'testEntity'";
+        "SELECT entity_version, to_purge_timestamp, internal_properties, catalog_id, purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, name, location, id, drop_timestamp, properties, grant_records_version, type_code FROM POLARIS_SCHEMA.ENTITIES WHERE entity_version = 1 AND name = 'testEntity'";


This may be irrelevant to the current PR - but why did location get added in this position amongst the columns? I'm assuming that this behavior is at least idempotent - but in ModelEntity it is at the last position, but when the query generated here, it is not.

If I read the code correctly, the ordering comes the way it is due to the keySet() iterator on the HashMap we create in the toMap() function. But key ordering is not guaranteed in HashMaps...so I'm not sure how these columns are idempotently ordered so far?

Any insights?

Update after talking to @singhpk234: I think this idempotent due to Java having the same amount of buckets for a HashMap when started and us adding the exact same keys to the map. But in the future, we should change over to a TreeMap or something similar, which will guarantee the ordering for us.

service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java

adnanhemani · 2025-05-29T20:28:33Z

Overall, I think this could be useful! The only thing I could think of is: if the location happens to be super long - is there a limit for how long the query string we send in JDBC can be? Other than that, I don't see additional load on the persistence layer - and as a result, this should be a good optimization to make!

getting-started/jdbc/docker-compose.yml

adnanhemani

LGTM - nice optimization!

dimas-b · 2025-06-02T17:48:12Z

...dbc/src/main/java/org/apache/polaris/extension/persistence/relational/jdbc/DatabaseType.java

@@ -44,6 +44,6 @@ public static DatabaseType fromDisplayName(String displayName) {
  }

  public String getInitScriptResource() {
-    return String.format("%s/schema-v1.sql", this.getDisplayName());
+    return String.format("%s/schema-v2.sql", this.getDisplayName());


Do we need a user-visible notice for this? I believe JDBC Persistence is about to be released in 0.10.0.

Schema changes (and upgrade implications) would be nice to discuss on the dev ML for awareness.

0.10 is an "experimental" release, and even if it wasn't I'm not sure what guarantees we make about backwards compatibility of persistence from one version to the next. Further, it seems like it will "just work" with old versions as they just won't select the new column. Is there any user-visible change?

Beyond this PR, we should probably define that and create a doc that maps between these schema files and releases.

dimas-b

This optimization looks reasonable to me overall. However, I believe that implementing it needs more attention to:

schema compatibility
Persistence interfaces
Explicit rules for what constitutes a "prefix"

extension/persistence/relational-jdbc/src/main/resources/h2/schema-v2.sql

getting-started/jdbc/docker-compose.yml

dimas-b · 2025-06-02T17:56:16Z

polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java

+          .defaultValue(true)
+          .buildFeatureConfiguration();
+
+  public static final FeatureConfiguration<Boolean> OPTIMIZED_SIBLING_CHECK =


Does this feature rely on ADD_TRAILING_SLASH_TO_LOCATION for correct operation?

It's a bit more nuanced than that; the previous locations should have been written with ADD_TRAILING_SLASH_TO_LOCATION enabled (or just with trailing slashes on them) for the optimized check to be correct. Added this to the description.

dimas-b · 2025-06-02T17:57:54Z

...dbc/src/main/java/org/apache/polaris/extension/persistence/relational/jdbc/DatabaseType.java

@@ -44,6 +44,6 @@ public static DatabaseType fromDisplayName(String displayName) {
  }

  public String getInitScriptResource() {
-    return String.format("%s/schema-v1.sql", this.getDisplayName());
+    return String.format("%s/schema-v2.sql", this.getDisplayName());


Schema changes (and upgrade implications) would be nice to discuss on the dev ML for awareness.

dimas-b · 2025-06-02T17:59:55Z

...n/java/org/apache/polaris/extension/persistence/relational/jdbc/JdbcBasePersistenceImpl.java

+    String query = generateVersionQuery();
+    try {
+      List<SchemaVersion> schemaVersion =
+          datasourceOperations.executeSelect(query, new SchemaVersion());


Using "capability" flags instead of (simple) versions would be preferable from my POV.

In this case it looks like we need to check specifically for the presence of the new index, other schema changes (if they existed) would be irrelevant.

Because the schema files are simply versioned, I think it makes sense to record the schema file which was used to create the tables in the metastore itself. Otherwise, you'd have to do a complex mapping back from capabilities back to a schema-v? file. That said, as we progress with adding capabilities, I think it make sense to write down somewhere which capabilities become available in which version.

dimas-b · 2025-06-02T18:01:50Z

...n/java/org/apache/polaris/extension/persistence/relational/jdbc/JdbcBasePersistenceImpl.java

+    String query = QueryGenerator.generateOverlapQuery(realmId, parentId, location);
+    try {
+      var results = datasourceOperations.executeSelect(query, new ModelEntity());
+      if (results.isEmpty()) {


What if the version is 2 but the location data is not present in tables?

Then the check will pass; if you have a persistence in a weird state like this, you should not enable OPTIMIZED_SIBLING_CHECK. It's disabled by default.

dimas-b · 2025-06-02T18:11:54Z

...c/test/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGeneratorTest.java

+            + "purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, "
+            + "name, location, id, drop_timestamp, properties, grant_records_version, type_code "
+            + "FROM POLARIS_SCHEMA.ENTITIES WHERE realm_id = 'realmId' AND parent_id = -398224152 "
+            + "AND (1 = 2 OR location = 's3:/' OR location = 's3://' OR location = 's3://bucket/' OR "


I believe the s3:/ and s3:// cases are conceptually incorrect.

Can you explain that belief more? I think that if you really have a entity at location s3: or s3:/ then something has gone very wrong. New tables at s3 locations can fail to create.

s3:/ is not a valid storage URI by any means. It is specifically not a valid prefix for a storage location. What is the rationale trying to find it in the location properties of other entries?

Granted, Polaris should not allow storing s3:/ as a location property. However, if that happens for any reason, the current impl. will block any other S3 catalog creation. In effect, the current optimization implementation has a tendency to exacerbate errors (as opposed to dampening them, or at least being neutral).

In effect, the current optimization implementation has a tendency to exacerbate errors

The error is real; the path example://path does overlap with example:/.

...c/test/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGeneratorTest.java

dimas-b · 2025-06-02T18:13:59Z

...c/test/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGeneratorTest.java

+            + "name, location, id, drop_timestamp, properties, grant_records_version, type_code "
+            + "FROM POLARIS_SCHEMA.ENTITIES WHERE realm_id = 'realmId' AND parent_id = -398224152 "
+            + "AND (1 = 2 OR location = 's3:/' OR location = 's3://' OR location = 's3://bucket/' OR "
+            + "location = 's3://bucket/tmp/' OR location = 's3://bucket/tmp/location/' OR location "


What about case differences in the URI scheme or bucket name?

What is a particular storage technology treats path in a case-insensitive way?

What about special chars in the URI?

What about case differences in the URI scheme or bucket name?

Then there is no overlap.

What is a particular storage technology treats path in a case-insensitive way?

The check is case sensitive, to be clear. From what I can see, the current check is also case sensitive (1, 2).

Do you mean what if a storage technology does that? Then the current logic in Polaris won't work.

What about special chars in the URI?

What about them?

Re: special chars, I believe this PR needs specific tests to ensure they are handled correctly.

Do you mean what if a storage technology does that?

If we later support a storage technology with case-insensitive file paths, the existing implementation, which is grounded in a database-level string comparison feature, will have a hard time supporting it (initially it will not work correctly, as far as I can tell).

Conversely, if the admin user for some reason (accidentally?) configures the database to perform case-insensitive comparisons, it will break Polaris Management API contracts (behaviour) for the current set of storage technologies.

All in all, current approach appear to introduce too many obscure dependencies between API-level behaviours and backend behaviours (IMHO).

I support the idea of leveraging database capabilities to facilitate finding overlapping catalogs. However, I think it should be done in a way and any oddities at the database level do not affect the correctness of user-visible Polaris behaviour. In other words, if something is not aligned at the database level, it would be acceptable for Polaris performance to degrade as long as correctness is not affected.

In that regard, the presence of any entity with a location, but without an indexed location, should probably invalidate the optimized lookup result. If this requires admin user intervention, we have to make it explicit.

Also, the parsing of the URI into path components (prefixes) should probably be generalized outside the Persistence layer (for reuse by all Persistence implementations).

If we later support a storage technology with case-insensitive file paths, the existing implementation ... will have a hard time supporting it

We will have to return Optional.empty() for the hasOverlappingSiblings method when that storage type is in use. So we'll have an easy time supporting this hypothetical storage technology, just not using the optimized check. However, the 4 storage technologies we currently actually support do work with the optimized check.

Conversely, if the admin user for some reason (accidentally?) configures the database to perform case-insensitive comparisons

If this is a real configuration, then tons of things in our existing persistence implementation will break very severely, such as lookup by name. We rely on the database being case-sensitive already, so this PR isn't introducing any new dependency there.

Also, the parsing of the URI into path components (prefixes) should probably be generalized outside the Persistence layer

Only one persistence implementation currently supports this check, so encapsulating the logic there seems wise.

In that regard, the presence of any entity with a location, but without an indexed location, should probably invalidate the optimized lookup result.

This is not a state you can get into to have in the current implemenation.

dimas-b · 2025-06-02T18:20:37Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/BasePersistence.java

+   * Check if the specified IcebergTableLikeEntity has any same-namespace siblings which share a
+   * location
+   *
+   * @param callContext the polaris call context
+   * @param parentId the parent entity to look for duplicates inside
+   * @param location the location to check for overlaps against
+   * @return Optional.of(Optional.of(location)) if the parent entity has children,
+   *     Optional.of(Optional.empty()) if not, and Optional.empty() if the metastore doesn't support
+   *     this operation
+   */
+  Optional<Optional<String>> hasOverlappingSiblings(
+      @Nonnull PolarisCallContext callContext, long parentId, String location);


This method implies that all Persistence implementations has a common understanding of what location means in entities. However, ATM, the location is just an opaque entity property (part of a map of generic properties).

I think, we need to, at a minimum, to add javadoc about how this location is defined. Ideally, since it is used as a first interface parameter, it should be exposes as a first class accessor method in entities.

This method implies that all Persistence implementations has a common understanding of what location means in entities.

I don't think that's exactly true. There's nothing here that implies commonality across persistence implementations. One persistence implementation could have location as the base location, while another persistence implementation could have the base location flipped backwards. Both implementations could work.

Ideally, since it is used as a first interface parameter, it should be exposes as a first class accessor method in entities.

The entities that have a location do expose a method to get the location -- getBaseLocation. There's no interface for this method. I would love to see this refactored, and then this method could take a EntityWithLocation or whatever interface we choose to have the location-having entities implement. But since we don't use those types consistently now, I think that change would be quite difficult to make. It also has implications on the persistence implementations themselves and on how we treat the various entity objects, and previous discussions about that (a refactor oriented around DAOs) were highly contentious.

Sorry, I missed out on getDefaultBaseLocation() in CatalogEntity.

However, I believe the new API method should be clear about the relationship of CatalogEntity.getDefaultBaseLocation() and the location method parameter.

Also, re: getDefaultBaseLocation() ... since it's "default", is there a possibility it's overridden? Is there a possibility a catalog has other, non-default locations that need to be checked for overlaps?

On your last note, the intent here is to optimize the current logic. The current logic looks at getBaseLocation.

I've introduced a new type to capture all the entities that have a location and refactored everything to use that type (mixed into PolarisEntity). Please let me know what you think.

dimas-b · 2025-06-03T17:17:31Z

polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java

-                  + " supported by the JDBC metastore.")
+              "When set, an index is used to perform the sibling check between tables, views, and namespaces. New " +
+                  "locations will be checked against previous ones based on components, so the new location " +
+                  "/foo/bar/ will check for a sibling at /, /foo/ and /foo/bar/%. In order for this check to " +


nit: It might be preferable to avoid using plain file locations in example since the FILE storage type is disabled by default.

…heck

snazy · 2025-06-10T12:10:00Z

service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java

+    String name = entity.getName();
+
+    // Attempt to directly query for siblings
+    boolean useOptimizedSiblingCheck =


Isn't this something that the persistence implementation should decide?

eric-maynard added 17 commits May 23, 2025 11:00

initial commit

ba0ec2b

add index

6ee1262

changes

5426f16

needs refactoring

2f8ad59

wire model

7172b05

continue wiring

8a6f567

feature configs

f8f615c

building

911b37b

testing

b731fe1

failed to find version

ef8e024

stable

79b8ec2

slash fix

70a3481

stable with slash

6529a84

testing

ac0b17e

Merge branch 'main' of github.com:apache/polaris into index-sibling-c…

d8a3140

…heck

sql fix

30412b5

review

7e69c97

eric-maynard requested review from adutra, ashvina, dennishuo, dimas-b, jackye1995, jbonofre, vvcephei, collado-mike, snazy, RussellSpitzer, takidau, MonkeyCanCode and flyrain as code owners May 27, 2025 23:34

eric-maynard requested review from singhpk234 and pingtimeout as code owners May 27, 2025 23:34

github-project-automation bot added this to Basic Kanban Board May 27, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board May 27, 2025

eric-maynard added 5 commits May 27, 2025 16:36

polish

c88724a

QG unit test

dc1b6cb

autolint

98ba46e

bugfix

0990483

autolint

5650be2

singhpk234 reviewed May 29, 2025

View reviewed changes

adnanhemani reviewed May 29, 2025

View reviewed changes

getting-started/jdbc/docker-compose.yml Outdated Show resolved Hide resolved

eric-maynard added 2 commits May 30, 2025 09:57

changes per review

f11205d

testfix

c9e2d5c

adnanhemani approved these changes May 31, 2025

View reviewed changes

dimas-b reviewed Jun 2, 2025

View reviewed changes

eric-maynard added 2 commits June 2, 2025 21:53

some changes per review

8605a1c

autolint

aa1591c

dimas-b reviewed Jun 3, 2025

View reviewed changes

eric-maynard added 8 commits June 3, 2025 14:03

pull main

8b5e71e

some special characters

4b49d01

some special characters

b1d6425

refactor in progress

e4c8c2c

Semistable

ecda6e2

stable after refactor

a976bb8

Merge branch 'main' of github.com:apache/polaris into index-sibling-c…

12d8530

…heck

remove 1 = 2

1005a23

snazy reviewed Jun 10, 2025

View reviewed changes

Optimize the location overlap check with an index #1686

Are you sure you want to change the base?

Optimize the location overlap check with an index #1686

Uh oh!

Conversation

eric-maynard commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current Behavior

Proposed change

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adnanhemani commented May 29, 2025

Uh oh!

Uh oh!

adnanhemani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

eric-maynard commented May 27, 2025 •

edited

Loading

eric-maynard Jun 3, 2025 •

edited

Loading

eric-maynard Jun 3, 2025 •

edited

Loading

eric-maynard Jun 3, 2025 •

edited

Loading

eric-maynard Jun 3, 2025 •

edited

Loading

eric-maynard Jun 3, 2025 •

edited

Loading