Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Register existing tables in Iceberg HiveCatalog #3851

Merged
merged 1 commit into from
Jan 13, 2022

Conversation

anuragmantri
Copy link
Contributor

This PR allows us to register existing tables in Iceberg HiveCatalog.

Right now, we can keep metadata and data files while dropping tables from Iceberg HiveCatalog. However, there is no way to register those tables back even though the metadata and data files can be still there. So, we need to extend HiveCatalog with another method that will accept a location to the valid metadata file and would register a table in HMS.

This will allow us to properly support external tables in Spark.

}

@Test
public void testCloneTable() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this test just because I don't want to give anyone ideas :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed :) Removed this test.

@pvary
Copy link
Contributor

pvary commented Jan 6, 2022

Can I use this to register an arbitrary table from any catalog (HadoopTable or HadoopCatalog or GlueCatalog) to the HiveCatalog?

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good to me, had some comments

Preconditions.checkArgument(isValidIdentifier(identifier), "Invalid identifier: %s", identifier);

TableOperations ops = newTableOps(identifier);
HadoopInputFile metadataFile = HadoopInputFile.fromLocation(metadataFileLocation, conf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use FileIO to get the InputFile instead of hardcoding HadoopFileIO (in case we are using S3FileIO, for example?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Szehon. fileIO should be used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Updated to use FileIO.

@@ -340,6 +340,17 @@ default boolean dropTable(TableIdentifier identifier) {
*/
Table loadTable(TableIdentifier identifier);

/**
* Register a table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems from the test, the table is not there and it will create one from the file. It wasn't too obvious from the name of the API, can we enhance the javadoc a bit to detail that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the table is already there? Do we throw an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the java doc to mention this API will register a table with the catalog if it does not exist. It throws an exception if it exists. I have also added a unit test case.

Comment on lines +213 to +214
String newMetadataLocation = base == null && metadata.metadataFileLocation() != null ?
metadata.metadataFileLocation() : writeNewMetadata(metadata, currentVersion() + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this change is not related. Can we remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used by the new test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, registering a table creates a new metadata file with a new version instead of using the version provided by the metadata file. Yes, the tests also rely on this.

@anuragmantri
Copy link
Contributor Author

Can I use this to register an arbitrary table from any catalog (HadoopTable or HadoopCatalog or GlueCatalog) to the HiveCatalog?

@pvary - Yes, all this needs is the latest metadata file.

@pvary
Copy link
Contributor

pvary commented Jan 8, 2022

@pvary - Yes, all this needs is the latest metadata file.

This was my conclusion too (after I have left the comment I continued chewing on this one 😄) Then I went further, and this is where I am now:

  • What happens, if a table is modified in one of the catalogs (I suspect that this is catalog dependent as HiveCatalog will not follow the changes, but HadoopTable/HadoopCatalog might follow each others changes, as they rely on the file names)
  • Also issues will arise if we start expiring snapshots. If one catalog decides that one of the files are not needed anymore, then it could remove the file, even if the file is still needed for the other table.

Is there a way to prevent this situation?

Thanks, Peter

@@ -340,6 +340,18 @@ default boolean dropTable(TableIdentifier identifier) {
*/
Table loadTable(TableIdentifier identifier);

/**
* Register a table with the catalog if it does not exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @bryanck. What do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it is very straightforward, I just had one question below...

@@ -211,6 +214,23 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) {
}
}

@Override
public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this in BaseMetastoreCatalog? It doesn't look like there is anything Hive-specific here, so other catalog implementations could potentially benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Potentially, other catalogs can benefit from this. However, FileIO is initialized with the catalog and there maybe custom implementations passed as catalog properties. I'm not sure yet on how to move this logic to BaseMetastoreCatalog. Do you mind if I do that as a seperate enhancement PR to this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks

@anuragmantri
Copy link
Contributor Author

  • What happens, if a table is modified in one of the catalogs (I suspect that this is catalog dependent as HiveCatalog will not follow the changes, but HadoopTable/HadoopCatalog might follow each others changes, as they rely on the file names)
  • Also issues will arise if we start expiring snapshots. If one catalog decides that one of the files are not needed anymore, then it could remove the file, even if the file is still needed for the other table.

Thanks for the questions @pvary. I haven't really thought about concurrent access to the tables from multiple catalogs. It did come up a little bit in our DR discussion in the dev list. I'm not sure this should be allowed.

@aokolnychyi - What are your thoughts on this?

@rdblue
Copy link
Contributor

rdblue commented Jan 12, 2022

This looks good to me when tests are passing.

@anuragmantri
Copy link
Contributor Author

anuragmantri commented Jan 12, 2022

Looks like the build failed. Is there a way to re-trigger the build without pushing a new change?

@flyrain
Copy link
Contributor

flyrain commented Jan 12, 2022

Looks like this build errors happened in multiple PRs. I saw it #3745 as well. It could be an issue in build infra or some changes in Gradle. @rdblue

A problem occurred configuring root project 'iceberg'.
31
> Could not resolve all files for configuration ':classpath'.
32
   > Could not resolve me.champeau.jmh:jmh-gradle-plugin:0.6.6.
33
     Required by:
34
         project :
35
      > Could not resolve me.champeau.jmh:jmh-gradle-plugin:0.6.6.
36
         > Could not get resource 'https://plugins.gradle.org/m2/me/champeau/jmh/jmh-gradle-plugin/0.6.6/jmh-gradle-plugin-0.6.6.module'.
37
            > Could not GET 'https://jcenter.bintray.com/me/champeau/jmh/jmh-gradle-plugin/0.6.6/jmh-gradle-plugin-0.6.6.module'. Received status code 502 from server: Bad Gateway
38
   > Could not resolve org.apache.logging.log4j:log4j-core:2.17.0.

@bryanck
Copy link
Contributor

bryanck commented Jan 12, 2022

Seems like jcenter is having issues... https://status.gradle.com

@flyrain
Copy link
Contributor

flyrain commented Jan 13, 2022

Seems like jcenter is having issues... https://status.gradle.com

The Jcenter is back online now. The build should pass once it is retriggered.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea why the Javadoc build is breaking, I re ran it several times through the workflow. But the links it can't get the resource from also don't work for me.

@RussellSpitzer RussellSpitzer merged commit 643a8ac into apache:master Jan 13, 2022
@RussellSpitzer
Copy link
Member

Tests Passed and Merged! Thanks @anuragmantri and all reviewers!

@anuragmantri
Copy link
Contributor Author

Thanks to @aokolnychyi, the original author of the PR and everyone for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants