Skip to content

Optimize the location overlap check with an index #1686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented May 27, 2025

The location overlap check for "sibling" tables (those which share a parent) has been a performance bottleneck since its introduction, but we haven't historically had a good way around this other than just disabling the check.


Current Behavior

The current logic is that when we create a table, we list all sibling tables and check each and every one to ensure there is no location overlap. This results in O(N^2) checks when adding N tables to a namespace, quickly becoming untenable.

With the CreateTreeDataset benchmark I tested creating 5000 sibling tables using the current code:

Screenshot 2025-05-27 at 4 26 56 PM

It is apparent that latency increases over time. Runs took between 90 and 200+ seconds, and Polaris instances with a small memory allocation were prone to crashing due to OOMs:

Screenshot 2025-05-27 at 4 33 57 PM

Proposed change

This PR adds a new persistence API, hasOverlappingSiblings, which if implemented can be used to directly check for the presence of siblings at the metastore layer.

This API is implemented for the JDBC metastore in a new schema version, and some changes are made to account for an evolving schema version now and in the future.

This implementation breaks a location down into components and queries for a sibling at each of those locations, so a new table at location s3://bucket/root/n1/nA/t1/ will require checking for an entity with location s3://bucket/, s3://bucket/root/, s3://bucket/root/n1/, s3://bucket/root/n1/nA/, and finally s3://bucket/root/n1/nA/t1/%. All of this can be done in a single query which makes a single pass over the data.

The query is optimized by the introduction of a new index over a new location column.

With the changes enabled, I tested creating 5000 sibling tables:

Screenshot 2025-05-27 at 4 32 12 PM

Latency is stable over time, and runs consistently completed in less than 30 seconds. I did not observe any OOMs when testing with the feature enabled.

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great improvement !

Comment on lines 219 to 235
public static String generateOverlapQuery(String realmId, long parentId, String location) {
String[] components = location.split("/");
StringBuilder locationClauseBuilder = new StringBuilder();
StringBuilder pathBuilder = new StringBuilder();
for (String component : components) {
pathBuilder.append(component).append("/");
locationClauseBuilder.append(String.format(" OR location = '%s'", pathBuilder));
}
locationClauseBuilder.append(String.format(" OR location LIKE '%s%%'", location));
String query = "SELECT " + String.join(", ", new ModelEntity().toMap().keySet());

// TODO harden against realmId in this method and others
return query
+ String.format(
" FROM POLARIS_SCHEMA.entities WHERE realm_id = '%s' AND parent_id = %d AND (1 = 2%s)",
realmId, parentId, locationClauseBuilder);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this means we now don't need to load table from persistence and can just work directly without persistence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to load the table if all we want to do is the sibling check, yeah. I select only the location for logging purposes

Comment on lines 582 to 584
if (this.version < 2) {
return Optional.empty();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we haven't release v1 yet, if we can get your pr in we might not require v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed just wanted to illustrate how we can use the version table

@@ -45,7 +45,7 @@ void testGenerateSelectQuery_withMapWhereClause() {
whereClause.put("name", "testEntity");
whereClause.put("entity_version", 1);
String expectedQuery =
"SELECT entity_version, to_purge_timestamp, internal_properties, catalog_id, purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, name, id, drop_timestamp, properties, grant_records_version, type_code FROM POLARIS_SCHEMA.ENTITIES WHERE entity_version = 1 AND name = 'testEntity'";
"SELECT entity_version, to_purge_timestamp, internal_properties, catalog_id, purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, name, location, id, drop_timestamp, properties, grant_records_version, type_code FROM POLARIS_SCHEMA.ENTITIES WHERE entity_version = 1 AND name = 'testEntity'";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be irrelevant to the current PR - but why did location get added in this position amongst the columns? I'm assuming that this behavior is at least idempotent - but in ModelEntity it is at the last position, but when the query generated here, it is not.

If I read the code correctly, the ordering comes the way it is due to the keySet() iterator on the HashMap we create in the toMap() function. But key ordering is not guaranteed in HashMaps...so I'm not sure how these columns are idempotently ordered so far?

Any insights?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update after talking to @singhpk234: I think this idempotent due to Java having the same amount of buckets for a HashMap when started and us adding the exact same keys to the map. But in the future, we should change over to a TreeMap or something similar, which will guarantee the ordering for us.

@adnanhemani
Copy link
Collaborator

Overall, I think this could be useful! The only thing I could think of is: if the location happens to be super long - is there a limit for how long the query string we send in JDBC can be? Other than that, I don't see additional load on the persistence layer - and as a result, this should be a good optimization to make!

Copy link
Collaborator

@adnanhemani adnanhemani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice optimization!

@@ -44,6 +44,6 @@ public static DatabaseType fromDisplayName(String displayName) {
}

public String getInitScriptResource() {
return String.format("%s/schema-v1.sql", this.getDisplayName());
return String.format("%s/schema-v2.sql", this.getDisplayName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a user-visible notice for this? I believe JDBC Persistence is about to be released in 0.10.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema changes (and upgrade implications) would be nice to discuss on the dev ML for awareness.

Copy link
Contributor Author

@eric-maynard eric-maynard Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.10 is an "experimental" release, and even if it wasn't I'm not sure what guarantees we make about backwards compatibility of persistence from one version to the next. Further, it seems like it will "just work" with old versions as they just won't select the new column. Is there any user-visible change?

Beyond this PR, we should probably define that and create a doc that maps between these schema files and releases.

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization looks reasonable to me overall. However, I believe that implementing it needs more attention to:

  • schema compatibility
  • Persistence interfaces
  • Explicit rules for what constitutes a "prefix"

.defaultValue(true)
.buildFeatureConfiguration();

public static final FeatureConfiguration<Boolean> OPTIMIZED_SIBLING_CHECK =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this feature rely on ADD_TRAILING_SLASH_TO_LOCATION for correct operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more nuanced than that; the previous locations should have been written with ADD_TRAILING_SLASH_TO_LOCATION enabled (or just with trailing slashes on them) for the optimized check to be correct. Added this to the description.

@@ -44,6 +44,6 @@ public static DatabaseType fromDisplayName(String displayName) {
}

public String getInitScriptResource() {
return String.format("%s/schema-v1.sql", this.getDisplayName());
return String.format("%s/schema-v2.sql", this.getDisplayName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema changes (and upgrade implications) would be nice to discuss on the dev ML for awareness.

String query = generateVersionQuery();
try {
List<SchemaVersion> schemaVersion =
datasourceOperations.executeSelect(query, new SchemaVersion());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using "capability" flags instead of (simple) versions would be preferable from my POV.

In this case it looks like we need to check specifically for the presence of the new index, other schema changes (if they existed) would be irrelevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the schema files are simply versioned, I think it makes sense to record the schema file which was used to create the tables in the metastore itself. Otherwise, you'd have to do a complex mapping back from capabilities back to a schema-v? file. That said, as we progress with adding capabilities, I think it make sense to write down somewhere which capabilities become available in which version.

String query = QueryGenerator.generateOverlapQuery(realmId, parentId, location);
try {
var results = datasourceOperations.executeSelect(query, new ModelEntity());
if (results.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the version is 2 but the location data is not present in tables?

Copy link
Contributor Author

@eric-maynard eric-maynard Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the check will pass; if you have a persistence in a weird state like this, you should not enable OPTIMIZED_SIBLING_CHECK. It's disabled by default.

+ "purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, "
+ "name, location, id, drop_timestamp, properties, grant_records_version, type_code "
+ "FROM POLARIS_SCHEMA.ENTITIES WHERE realm_id = 'realmId' AND parent_id = -398224152 "
+ "AND (1 = 2 OR location = 's3:/' OR location = 's3://' OR location = 's3://bucket/' OR "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the s3:/ and s3:// cases are conceptually incorrect.

Copy link
Contributor Author

@eric-maynard eric-maynard Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain that belief more? I think that if you really have a entity at location s3: or s3:/ then something has gone very wrong. New tables at s3 locations can fail to create.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3:/ is not a valid storage URI by any means. It is specifically not a valid prefix for a storage location. What is the rationale trying to find it in the location properties of other entries?

Granted, Polaris should not allow storing s3:/ as a location property. However, if that happens for any reason, the current impl. will block any other S3 catalog creation. In effect, the current optimization implementation has a tendency to exacerbate errors (as opposed to dampening them, or at least being neutral).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In effect, the current optimization implementation has a tendency to exacerbate errors

The error is real; the path example://path does overlap with example:/.

+ "name, location, id, drop_timestamp, properties, grant_records_version, type_code "
+ "FROM POLARIS_SCHEMA.ENTITIES WHERE realm_id = 'realmId' AND parent_id = -398224152 "
+ "AND (1 = 2 OR location = 's3:/' OR location = 's3://' OR location = 's3://bucket/' OR "
+ "location = 's3://bucket/tmp/' OR location = 's3://bucket/tmp/location/' OR location "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about case differences in the URI scheme or bucket name?

What is a particular storage technology treats path in a case-insensitive way?

What about special chars in the URI?

Copy link
Contributor Author

@eric-maynard eric-maynard Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about case differences in the URI scheme or bucket name?

Then there is no overlap.

What is a particular storage technology treats path in a case-insensitive way?

The check is case sensitive, to be clear. From what I can see, the current check is also case sensitive (1, 2).

Do you mean what if a storage technology does that? Then the current logic in Polaris won't work.

What about special chars in the URI?

What about them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: special chars, I believe this PR needs specific tests to ensure they are handled correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean what if a storage technology does that?

If we later support a storage technology with case-insensitive file paths, the existing implementation, which is grounded in a database-level string comparison feature, will have a hard time supporting it (initially it will not work correctly, as far as I can tell).

Conversely, if the admin user for some reason (accidentally?) configures the database to perform case-insensitive comparisons, it will break Polaris Management API contracts (behaviour) for the current set of storage technologies.

All in all, current approach appear to introduce too many obscure dependencies between API-level behaviours and backend behaviours (IMHO).

I support the idea of leveraging database capabilities to facilitate finding overlapping catalogs. However, I think it should be done in a way and any oddities at the database level do not affect the correctness of user-visible Polaris behaviour. In other words, if something is not aligned at the database level, it would be acceptable for Polaris performance to degrade as long as correctness is not affected.

In that regard, the presence of any entity with a location, but without an indexed location, should probably invalidate the optimized lookup result. If this requires admin user intervention, we have to make it explicit.

Also, the parsing of the URI into path components (prefixes) should probably be generalized outside the Persistence layer (for reuse by all Persistence implementations).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we later support a storage technology with case-insensitive file paths, the existing implementation ... will have a hard time supporting it

We will have to return Optional.empty() for the hasOverlappingSiblings method when that storage type is in use. So we'll have an easy time supporting this hypothetical storage technology, just not using the optimized check. However, the 4 storage technologies we currently actually support do work with the optimized check.

Conversely, if the admin user for some reason (accidentally?) configures the database to perform case-insensitive comparisons

If this is a real configuration, then tons of things in our existing persistence implementation will break very severely, such as lookup by name. We rely on the database being case-sensitive already, so this PR isn't introducing any new dependency there.

Also, the parsing of the URI into path components (prefixes) should probably be generalized outside the Persistence layer

Only one persistence implementation currently supports this check, so encapsulating the logic there seems wise.

In that regard, the presence of any entity with a location, but without an indexed location, should probably invalidate the optimized lookup result.

This is not a state you can get into to have in the current implemenation.

Comment on lines 408 to 419
* Check if the specified IcebergTableLikeEntity has any same-namespace siblings which share a
* location
*
* @param callContext the polaris call context
* @param parentId the parent entity to look for duplicates inside
* @param location the location to check for overlaps against
* @return Optional.of(Optional.of(location)) if the parent entity has children,
* Optional.of(Optional.empty()) if not, and Optional.empty() if the metastore doesn't support
* this operation
*/
Optional<Optional<String>> hasOverlappingSiblings(
@Nonnull PolarisCallContext callContext, long parentId, String location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method implies that all Persistence implementations has a common understanding of what location means in entities. However, ATM, the location is just an opaque entity property (part of a map of generic properties).

I think, we need to, at a minimum, to add javadoc about how this location is defined. Ideally, since it is used as a first interface parameter, it should be exposes as a first class accessor method in entities.

Copy link
Contributor Author

@eric-maynard eric-maynard Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method implies that all Persistence implementations has a common understanding of what location means in entities.

I don't think that's exactly true. There's nothing here that implies commonality across persistence implementations. One persistence implementation could have location as the base location, while another persistence implementation could have the base location flipped backwards. Both implementations could work.

Ideally, since it is used as a first interface parameter, it should be exposes as a first class accessor method in entities.

The entities that have a location do expose a method to get the location -- getBaseLocation. There's no interface for this method. I would love to see this refactored, and then this method could take a EntityWithLocation or whatever interface we choose to have the location-having entities implement. But since we don't use those types consistently now, I think that change would be quite difficult to make. It also has implications on the persistence implementations themselves and on how we treat the various entity objects, and previous discussions about that (a refactor oriented around DAOs) were highly contentious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed out on getDefaultBaseLocation() in CatalogEntity.

However, I believe the new API method should be clear about the relationship of CatalogEntity.getDefaultBaseLocation() and the location method parameter.

Also, re: getDefaultBaseLocation() ... since it's "default", is there a possibility it's overridden? Is there a possibility a catalog has other, non-default locations that need to be checked for overlaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On your last note, the intent here is to optimize the current logic. The current logic looks at getBaseLocation.

I've introduced a new type to capture all the entities that have a location and refactored everything to use that type (mixed into PolarisEntity). Please let me know what you think.

+ " supported by the JDBC metastore.")
"When set, an index is used to perform the sibling check between tables, views, and namespaces. New " +
"locations will be checked against previous ones based on components, so the new location " +
"/foo/bar/ will check for a sibling at /, /foo/ and /foo/bar/%. In order for this check to " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be preferable to avoid using plain file locations in example since the FILE storage type is disabled by default.

String name = entity.getName();

// Attempt to directly query for siblings
boolean useOptimizedSiblingCheck =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this something that the persistence implementation should decide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants