-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Register existing tables in Iceberg HiveCatalog #3851
Conversation
} | ||
|
||
@Test | ||
public void testCloneTable() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this test just because I don't want to give anyone ideas :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed :) Removed this test.
1e86f9e
to
72f3128
Compare
Can I use this to register an arbitrary table from any catalog (HadoopTable or HadoopCatalog or GlueCatalog) to the HiveCatalog? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks good to me, had some comments
Preconditions.checkArgument(isValidIdentifier(identifier), "Invalid identifier: %s", identifier); | ||
|
||
TableOperations ops = newTableOps(identifier); | ||
HadoopInputFile metadataFile = HadoopInputFile.fromLocation(metadataFileLocation, conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use FileIO to get the InputFile instead of hardcoding HadoopFileIO (in case we are using S3FileIO, for example?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Szehon. fileIO
should be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. Updated to use FileIO
.
@@ -340,6 +340,17 @@ default boolean dropTable(TableIdentifier identifier) { | |||
*/ | |||
Table loadTable(TableIdentifier identifier); | |||
|
|||
/** | |||
* Register a table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems from the test, the table is not there and it will create one from the file. It wasn't too obvious from the name of the API, can we enhance the javadoc a bit to detail that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the table is already there? Do we throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the java doc to mention this API will register a table with the catalog if it does not exist. It throws an exception if it exists. I have also added a unit test case.
String newMetadataLocation = base == null && metadata.metadataFileLocation() != null ? | ||
metadata.metadataFileLocation() : writeNewMetadata(metadata, currentVersion() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this change is not related. Can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used by the new test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, registering a table creates a new metadata file with a new version instead of using the version provided by the metadata file. Yes, the tests also rely on this.
72f3128
to
cedc545
Compare
@pvary - Yes, all this needs is the latest metadata file. |
This was my conclusion too (after I have left the comment I continued chewing on this one 😄) Then I went further, and this is where I am now:
Is there a way to prevent this situation? Thanks, Peter |
@@ -340,6 +340,18 @@ default boolean dropTable(TableIdentifier identifier) { | |||
*/ | |||
Table loadTable(TableIdentifier identifier); | |||
|
|||
/** | |||
* Register a table with the catalog if it does not exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @bryanck. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, it is very straightforward, I just had one question below...
@@ -211,6 +214,23 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { | |||
} | |||
} | |||
|
|||
@Override | |||
public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this in BaseMetastoreCatalog? It doesn't look like there is anything Hive-specific here, so other catalog implementations could potentially benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Potentially, other catalogs can benefit from this. However, FileIO
is initialized with the catalog and there maybe custom implementations passed as catalog properties. I'm not sure yet on how to move this logic to BaseMetastoreCatalog
. Do you mind if I do that as a seperate enhancement PR to this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks
Thanks for the questions @pvary. I haven't really thought about concurrent access to the tables from multiple catalogs. It did come up a little bit in our DR discussion in the dev list. I'm not sure this should be allowed. @aokolnychyi - What are your thoughts on this? |
cedc545
to
1f1c6df
Compare
This looks good to me when tests are passing. |
Looks like the build failed. Is there a way to re-trigger the build without pushing a new change? |
Looks like this build errors happened in multiple PRs. I saw it #3745 as well. It could be an issue in build infra or some changes in Gradle. @rdblue
|
Seems like jcenter is having issues... https://status.gradle.com |
The Jcenter is back online now. The build should pass once it is retriggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea why the Javadoc build is breaking, I re ran it several times through the workflow. But the links it can't get the resource from also don't work for me.
Tests Passed and Merged! Thanks @anuragmantri and all reviewers! |
Thanks to @aokolnychyi, the original author of the PR and everyone for the review! |
This PR allows us to register existing tables in Iceberg HiveCatalog.
Right now, we can keep metadata and data files while dropping tables from Iceberg HiveCatalog. However, there is no way to register those tables back even though the metadata and data files can be still there. So, we need to extend HiveCatalog with another method that will accept a location to the valid metadata file and would register a table in HMS.
This will allow us to properly support external tables in Spark.