Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TableLocationSupplier for more flexible table placement #2855

Closed
wants to merge 3 commits into from

Conversation

danielcweeks
Copy link
Contributor

@danielcweeks danielcweeks commented Jul 23, 2021

Currently table location placement is somewhat restrictive because the base location is defined by the catalog, but doesn't have all relevant information to construct paths for certain scenarios. The only information available to the catalog is TableIdentifier.

Adding a TableLocationSupplier interface allows for more information to be used in defining the location of the table and allow for more flexible layouts and take advantage of other fields that are only populated later in the table creation lifecycle (like uuid, partition spec, sort order, or table properties).

Due to the way the table/metadata creation is performed, this is a little difficult to integrate both cleanly and in a backwards compatible way.

Posting this as a possible approach to allow for more flexible table layouts.

@@ -154,6 +155,25 @@ public String toString() {

protected abstract String defaultWarehouseLocation(TableIdentifier tableIdentifier);

protected TableLocationSupplier tableLocationSupplier(TableIdentifier tableIdentifier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to make this a supplier rather than passing the incoming options to a method like defaultWarehouseLocation? Is that because the table UUID is created in the TableMetadata constructor?

Are there cases where you think the location would be based on the schema or sort order? I can't think of valid cases and I wonder if this could actually be a bad idea. For example, maybe this could look for a column pii struct<email string, ...> and choose to place the table in a "sensitive" bucket, but I don't think that would be a good pattern for people to follow because it is error-prone: using PII for the column name could break it.

I'm also skeptical that location would be based on the partition spec, especially because the spec may change.

@rdblue
Copy link
Contributor

rdblue commented Sep 12, 2021

It seems like this is required because the table UUID is generated by TableMetadata. I think that I would rather simplify this and pass the UUID into TableMetadata when it is created instead and customize the location entirely within the catalog. So an alternative to this is adding the TableMetadata builder, #2957.

@jackye1995
Copy link
Contributor

My understanding is that you are trying to make the default table location a function of all the other table states, and make this function a potentially pluggable interface.

The biggest use case I see is for tables renamed to have the same name as an old table, they will still have a different path and not clobber the data if UUID is a part of the table path. Please let me know if there is any additional important use case that I missed.

So to achieve this feature, I think another way is to have a different method in parallel to table builder's withLocation(String newLoaction) method called withTableRootLocationProvider(String providerImpl) (or this can be from table properties) to supply a root location provider. This provider is basically Function<TableMetadataBuilder, String>, and we will fill the null location value based on location provider in the table metadata builder.

In the PR that Ryan referenced, I have changed to implement the TableMetadataUpdateBuilder that works specifically for the update metadata use case. Given this case, I think it now makes more sense to have a separated TableMetadataBuilder to create new table metadata, and we can initialize the location provider and resolve the location in the builder.

To make this backwards compatible with existing implementations, we can remove the defaultWarehouseLocation method, and instead provide 2 implementations of the provider, which are HiveTableRootLocationProvider and HadoopTableRootLocationProvider, that should be able to cover all the existing use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants