-
Notifications
You must be signed in to change notification settings - Fork 249
Optimize the location overlap check with an index #1686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great improvement !
public static String generateOverlapQuery(String realmId, long parentId, String location) { | ||
String[] components = location.split("/"); | ||
StringBuilder locationClauseBuilder = new StringBuilder(); | ||
StringBuilder pathBuilder = new StringBuilder(); | ||
for (String component : components) { | ||
pathBuilder.append(component).append("/"); | ||
locationClauseBuilder.append(String.format(" OR location = '%s'", pathBuilder)); | ||
} | ||
locationClauseBuilder.append(String.format(" OR location LIKE '%s%%'", location)); | ||
String query = "SELECT " + String.join(", ", new ModelEntity().toMap().keySet()); | ||
|
||
// TODO harden against realmId in this method and others | ||
return query | ||
+ String.format( | ||
" FROM POLARIS_SCHEMA.entities WHERE realm_id = '%s' AND parent_id = %d AND (1 = 2%s)", | ||
realmId, parentId, locationClauseBuilder); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this means we now don't need to load table from persistence and can just work directly without persistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to load the table if all we want to do is the sibling check, yeah. I select only the location for logging purposes
if (this.version < 2) { | ||
return Optional.empty(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we haven't release v1 yet, if we can get your pr in we might not require v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed just wanted to illustrate how we can use the version table
...c/src/main/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGenerator.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/polaris/extension/persistence/relational/jdbc/models/ModelEntity.java
Outdated
Show resolved
Hide resolved
extension/persistence/relational-jdbc/src/main/resources/h2/schema-v2.sql
Outdated
Show resolved
Hide resolved
@@ -45,7 +45,7 @@ void testGenerateSelectQuery_withMapWhereClause() { | |||
whereClause.put("name", "testEntity"); | |||
whereClause.put("entity_version", 1); | |||
String expectedQuery = | |||
"SELECT entity_version, to_purge_timestamp, internal_properties, catalog_id, purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, name, id, drop_timestamp, properties, grant_records_version, type_code FROM POLARIS_SCHEMA.ENTITIES WHERE entity_version = 1 AND name = 'testEntity'"; | |||
"SELECT entity_version, to_purge_timestamp, internal_properties, catalog_id, purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, name, location, id, drop_timestamp, properties, grant_records_version, type_code FROM POLARIS_SCHEMA.ENTITIES WHERE entity_version = 1 AND name = 'testEntity'"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be irrelevant to the current PR - but why did location
get added in this position amongst the columns? I'm assuming that this behavior is at least idempotent - but in ModelEntity
it is at the last position, but when the query generated here, it is not.
If I read the code correctly, the ordering comes the way it is due to the keySet()
iterator on the HashMap
we create in the toMap()
function. But key ordering is not guaranteed in HashMap
s...so I'm not sure how these columns are idempotently ordered so far?
Any insights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update after talking to @singhpk234: I think this idempotent due to Java having the same amount of buckets for a HashMap when started and us adding the exact same keys to the map. But in the future, we should change over to a TreeMap
or something similar, which will guarantee the ordering for us.
service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
Show resolved
Hide resolved
Overall, I think this could be useful! The only thing I could think of is: if the location happens to be super long - is there a limit for how long the query string we send in JDBC can be? Other than that, I don't see additional load on the persistence layer - and as a result, this should be a good optimization to make! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice optimization!
@@ -44,6 +44,6 @@ public static DatabaseType fromDisplayName(String displayName) { | |||
} | |||
|
|||
public String getInitScriptResource() { | |||
return String.format("%s/schema-v1.sql", this.getDisplayName()); | |||
return String.format("%s/schema-v2.sql", this.getDisplayName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a user-visible notice for this? I believe JDBC Persistence is about to be released in 0.10.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema changes (and upgrade implications) would be nice to discuss on the dev
ML for awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.10 is an "experimental" release, and even if it wasn't I'm not sure what guarantees we make about backwards compatibility of persistence from one version to the next. Further, it seems like it will "just work" with old versions as they just won't select the new column. Is there any user-visible change?
Beyond this PR, we should probably define that and create a doc that maps between these schema files and releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization looks reasonable to me overall. However, I believe that implementing it needs more attention to:
- schema compatibility
- Persistence interfaces
- Explicit rules for what constitutes a "prefix"
extension/persistence/relational-jdbc/src/main/resources/h2/schema-v2.sql
Outdated
Show resolved
Hide resolved
.defaultValue(true) | ||
.buildFeatureConfiguration(); | ||
|
||
public static final FeatureConfiguration<Boolean> OPTIMIZED_SIBLING_CHECK = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this feature rely on ADD_TRAILING_SLASH_TO_LOCATION
for correct operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit more nuanced than that; the previous locations should have been written with ADD_TRAILING_SLASH_TO_LOCATION
enabled (or just with trailing slashes on them) for the optimized check to be correct. Added this to the description.
@@ -44,6 +44,6 @@ public static DatabaseType fromDisplayName(String displayName) { | |||
} | |||
|
|||
public String getInitScriptResource() { | |||
return String.format("%s/schema-v1.sql", this.getDisplayName()); | |||
return String.format("%s/schema-v2.sql", this.getDisplayName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema changes (and upgrade implications) would be nice to discuss on the dev
ML for awareness.
String query = generateVersionQuery(); | ||
try { | ||
List<SchemaVersion> schemaVersion = | ||
datasourceOperations.executeSelect(query, new SchemaVersion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using "capability" flags instead of (simple) versions would be preferable from my POV.
In this case it looks like we need to check specifically for the presence of the new index, other schema changes (if they existed) would be irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the schema files are simply versioned, I think it makes sense to record the schema file which was used to create the tables in the metastore itself. Otherwise, you'd have to do a complex mapping back from capabilities back to a schema-v?
file. That said, as we progress with adding capabilities, I think it make sense to write down somewhere which capabilities become available in which version.
String query = QueryGenerator.generateOverlapQuery(realmId, parentId, location); | ||
try { | ||
var results = datasourceOperations.executeSelect(query, new ModelEntity()); | ||
if (results.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the version is 2
but the location data is not present in tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the check will pass; if you have a persistence in a weird state like this, you should not enable OPTIMIZED_SIBLING_CHECK
. It's disabled by default.
+ "purge_timestamp, sub_type_code, create_timestamp, last_update_timestamp, parent_id, " | ||
+ "name, location, id, drop_timestamp, properties, grant_records_version, type_code " | ||
+ "FROM POLARIS_SCHEMA.ENTITIES WHERE realm_id = 'realmId' AND parent_id = -398224152 " | ||
+ "AND (1 = 2 OR location = 's3:/' OR location = 's3://' OR location = 's3://bucket/' OR " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the s3:/
and s3://
cases are conceptually incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain that belief more? I think that if you really have a entity at location s3:
or s3:/
then something has gone very wrong. New tables at s3 locations can fail to create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3:/
is not a valid storage URI by any means. It is specifically not a valid prefix for a storage location. What is the rationale trying to find it in the location properties of other entries?
Granted, Polaris should not allow storing s3:/
as a location property. However, if that happens for any reason, the current impl. will block any other S3 catalog creation. In effect, the current optimization implementation has a tendency to exacerbate errors (as opposed to dampening them, or at least being neutral).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In effect, the current optimization implementation has a tendency to exacerbate errors
The error is real; the path example://path
does overlap with example:/
.
...c/test/java/org/apache/polaris/extension/persistence/relational/jdbc/QueryGeneratorTest.java
Outdated
Show resolved
Hide resolved
+ "name, location, id, drop_timestamp, properties, grant_records_version, type_code " | ||
+ "FROM POLARIS_SCHEMA.ENTITIES WHERE realm_id = 'realmId' AND parent_id = -398224152 " | ||
+ "AND (1 = 2 OR location = 's3:/' OR location = 's3://' OR location = 's3://bucket/' OR " | ||
+ "location = 's3://bucket/tmp/' OR location = 's3://bucket/tmp/location/' OR location " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about case differences in the URI scheme or bucket name?
What is a particular storage technology treats path in a case-insensitive way?
What about special chars in the URI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about case differences in the URI scheme or bucket name?
Then there is no overlap.
What is a particular storage technology treats path in a case-insensitive way?
The check is case sensitive, to be clear. From what I can see, the current check is also case sensitive (1, 2).
Do you mean what if a storage technology does that? Then the current logic in Polaris won't work.
What about special chars in the URI?
What about them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: special chars, I believe this PR needs specific tests to ensure they are handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean what if a storage technology does that?
If we later support a storage technology with case-insensitive file paths, the existing implementation, which is grounded in a database-level string comparison feature, will have a hard time supporting it (initially it will not work correctly, as far as I can tell).
Conversely, if the admin user for some reason (accidentally?) configures the database to perform case-insensitive comparisons, it will break Polaris Management API contracts (behaviour) for the current set of storage technologies.
All in all, current approach appear to introduce too many obscure dependencies between API-level behaviours and backend behaviours (IMHO).
I support the idea of leveraging database capabilities to facilitate finding overlapping catalogs. However, I think it should be done in a way and any oddities at the database level do not affect the correctness of user-visible Polaris behaviour. In other words, if something is not aligned at the database level, it would be acceptable for Polaris performance to degrade as long as correctness is not affected.
In that regard, the presence of any entity with a location, but without an indexed location, should probably invalidate the optimized lookup result. If this requires admin user intervention, we have to make it explicit.
Also, the parsing of the URI into path components (prefixes) should probably be generalized outside the Persistence layer (for reuse by all Persistence implementations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we later support a storage technology with case-insensitive file paths, the existing implementation ... will have a hard time supporting it
We will have to return Optional.empty()
for the hasOverlappingSiblings
method when that storage type is in use. So we'll have an easy time supporting this hypothetical storage technology, just not using the optimized check. However, the 4 storage technologies we currently actually support do work with the optimized check.
Conversely, if the admin user for some reason (accidentally?) configures the database to perform case-insensitive comparisons
If this is a real configuration, then tons of things in our existing persistence implementation will break very severely, such as lookup by name. We rely on the database being case-sensitive already, so this PR isn't introducing any new dependency there.
Also, the parsing of the URI into path components (prefixes) should probably be generalized outside the Persistence layer
Only one persistence implementation currently supports this check, so encapsulating the logic there seems wise.
In that regard, the presence of any entity with a location, but without an indexed location, should probably invalidate the optimized lookup result.
This is not a state you can get into to have in the current implemenation.
* Check if the specified IcebergTableLikeEntity has any same-namespace siblings which share a | ||
* location | ||
* | ||
* @param callContext the polaris call context | ||
* @param parentId the parent entity to look for duplicates inside | ||
* @param location the location to check for overlaps against | ||
* @return Optional.of(Optional.of(location)) if the parent entity has children, | ||
* Optional.of(Optional.empty()) if not, and Optional.empty() if the metastore doesn't support | ||
* this operation | ||
*/ | ||
Optional<Optional<String>> hasOverlappingSiblings( | ||
@Nonnull PolarisCallContext callContext, long parentId, String location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method implies that all Persistence implementations has a common understanding of what location
means in entities. However, ATM, the location is just an opaque entity property (part of a map of generic properties).
I think, we need to, at a minimum, to add javadoc about how this location is defined. Ideally, since it is used as a first interface parameter, it should be exposes as a first class accessor method in entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method implies that all Persistence implementations has a common understanding of what location means in entities.
I don't think that's exactly true. There's nothing here that implies commonality across persistence implementations. One persistence implementation could have location
as the base location, while another persistence implementation could have the base location flipped backwards. Both implementations could work.
Ideally, since it is used as a first interface parameter, it should be exposes as a first class accessor method in entities.
The entities that have a location do expose a method to get the location -- getBaseLocation
. There's no interface for this method. I would love to see this refactored, and then this method could take a EntityWithLocation
or whatever interface we choose to have the location-having entities implement. But since we don't use those types consistently now, I think that change would be quite difficult to make. It also has implications on the persistence implementations themselves and on how we treat the various entity objects, and previous discussions about that (a refactor oriented around DAOs) were highly contentious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed out on getDefaultBaseLocation()
in CatalogEntity
.
However, I believe the new API method should be clear about the relationship of CatalogEntity.getDefaultBaseLocation()
and the location
method parameter.
Also, re: getDefaultBaseLocation()
... since it's "default", is there a possibility it's overridden? Is there a possibility a catalog has other, non-default locations that need to be checked for overlaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On your last note, the intent here is to optimize the current logic. The current logic looks at getBaseLocation
.
I've introduced a new type to capture all the entities that have a location and refactored everything to use that type (mixed into PolarisEntity
). Please let me know what you think.
+ " supported by the JDBC metastore.") | ||
"When set, an index is used to perform the sibling check between tables, views, and namespaces. New " + | ||
"locations will be checked against previous ones based on components, so the new location " + | ||
"/foo/bar/ will check for a sibling at /, /foo/ and /foo/bar/%. In order for this check to " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It might be preferable to avoid using plain file locations in example since the FILE storage type is disabled by default.
String name = entity.getName(); | ||
|
||
// Attempt to directly query for siblings | ||
boolean useOptimizedSiblingCheck = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this something that the persistence implementation should decide?
The location overlap check for "sibling" tables (those which share a parent) has been a performance bottleneck since its introduction, but we haven't historically had a good way around this other than just disabling the check.
Current Behavior
The current logic is that when we create a table, we list all sibling tables and check each and every one to ensure there is no location overlap. This results in O(N^2) checks when adding N tables to a namespace, quickly becoming untenable.
With the
CreateTreeDataset
benchmark I tested creating 5000 sibling tables using the current code:It is apparent that latency increases over time. Runs took between 90 and 200+ seconds, and Polaris instances with a small memory allocation were prone to crashing due to OOMs:
Proposed change
This PR adds a new persistence API,
hasOverlappingSiblings
, which if implemented can be used to directly check for the presence of siblings at the metastore layer.This API is implemented for the JDBC metastore in a new schema version, and some changes are made to account for an evolving schema version now and in the future.
This implementation breaks a location down into components and queries for a sibling at each of those locations, so a new table at location
s3://bucket/root/n1/nA/t1/
will require checking for an entity with locations3://bucket/
,s3://bucket/root/
,s3://bucket/root/n1/
,s3://bucket/root/n1/nA/
, and finallys3://bucket/root/n1/nA/t1/%
. All of this can be done in a single query which makes a single pass over the data.The query is optimized by the introduction of a new index over a new location column.
With the changes enabled, I tested creating 5000 sibling tables:
Latency is stable over time, and runs consistently completed in less than 30 seconds. I did not observe any OOMs when testing with the feature enabled.