API: Register existing tables in Iceberg HiveCatalog #3851

anuragmantri · 2022-01-05T19:23:27Z

This PR allows us to register existing tables in Iceberg HiveCatalog.

Right now, we can keep metadata and data files while dropping tables from Iceberg HiveCatalog. However, there is no way to register those tables back even though the metadata and data files can be still there. So, we need to extend HiveCatalog with another method that will accept a location to the valid metadata file and would register a table in HMS.

This will allow us to properly support external tables in Spark.

RussellSpitzer · 2022-01-06T00:11:11Z

hive-metastore/src/test/java/org/apache/iceberg/hive/HiveTableTest.java

+  }
+
+  @Test
+  public void testCloneTable() throws IOException {


I would remove this test just because I don't want to give anyone ideas :)

Agreed :) Removed this test.

pvary · 2022-01-06T08:07:28Z

Can I use this to register an arbitrary table from any catalog (HadoopTable or HadoopCatalog or GlueCatalog) to the HiveCatalog?

szehon-ho

Logic looks good to me, had some comments

szehon-ho · 2022-01-06T17:37:37Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

+    Preconditions.checkArgument(isValidIdentifier(identifier), "Invalid identifier: %s", identifier);
+
+    TableOperations ops = newTableOps(identifier);
+    HadoopInputFile metadataFile = HadoopInputFile.fromLocation(metadataFileLocation, conf);


Shouldn't we use FileIO to get the InputFile instead of hardcoding HadoopFileIO (in case we are using S3FileIO, for example?)

Agree with Szehon. fileIO should be used here.

Thanks for catching this. Updated to use FileIO.

szehon-ho · 2022-01-06T17:44:21Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

@@ -340,6 +340,17 @@ default boolean dropTable(TableIdentifier identifier) {
   */
  Table loadTable(TableIdentifier identifier);

+  /**
+   * Register a table.


Seems from the test, the table is not there and it will create one from the file. It wasn't too obvious from the name of the API, can we enhance the javadoc a bit to detail that?

What happens if the table is already there? Do we throw an exception?

Updated the java doc to mention this API will register a table with the catalog if it does not exist. It throws an exception if it exists. I have also added a unit test case.

flyrain · 2022-01-06T18:21:22Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+    String newMetadataLocation = base == null && metadata.metadataFileLocation() != null ?
+        metadata.metadataFileLocation() : writeNewMetadata(metadata, currentVersion() + 1);


Looks like this change is not related. Can we remove it?

Is it used by the new test?

Without this change, registering a table creates a new metadata file with a new version instead of using the version provided by the metadata file. Yes, the tests also rely on this.

anuragmantri · 2022-01-07T23:57:09Z

Can I use this to register an arbitrary table from any catalog (HadoopTable or HadoopCatalog or GlueCatalog) to the HiveCatalog?

@pvary - Yes, all this needs is the latest metadata file.

pvary · 2022-01-08T06:45:46Z

@pvary - Yes, all this needs is the latest metadata file.

This was my conclusion too (after I have left the comment I continued chewing on this one 😄) Then I went further, and this is where I am now:

What happens, if a table is modified in one of the catalogs (I suspect that this is catalog dependent as HiveCatalog will not follow the changes, but HadoopTable/HadoopCatalog might follow each others changes, as they rely on the file names)
Also issues will arise if we start expiring snapshots. If one catalog decides that one of the files are not needed anymore, then it could remove the file, even if the file is still needed for the other table.

Is there a way to prevent this situation?

Thanks, Peter

rdblue · 2022-01-09T22:41:19Z

api/src/main/java/org/apache/iceberg/catalog/Catalog.java

@@ -340,6 +340,18 @@ default boolean dropTable(TableIdentifier identifier) {
   */
  Table loadTable(TableIdentifier identifier);

+  /**
+   * Register a table with the catalog if it does not exist.


FYI @bryanck. What do you think about this?

LGTM, it is very straightforward, I just had one question below...

bryanck · 2022-01-10T15:21:06Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

@@ -211,6 +214,23 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) {
    }
  }

+  @Override
+  public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) {


Could we put this in BaseMetastoreCatalog? It doesn't look like there is anything Hive-specific here, so other catalog implementations could potentially benefit.

This is a good point. Potentially, other catalogs can benefit from this. However, FileIO is initialized with the catalog and there maybe custom implementations passed as catalog properties. I'm not sure yet on how to move this logic to BaseMetastoreCatalog. Do you mind if I do that as a seperate enhancement PR to this one?

Sure, thanks

anuragmantri · 2022-01-12T17:40:48Z

What happens, if a table is modified in one of the catalogs (I suspect that this is catalog dependent as HiveCatalog will not follow the changes, but HadoopTable/HadoopCatalog might follow each others changes, as they rely on the file names)

Also issues will arise if we start expiring snapshots. If one catalog decides that one of the files are not needed anymore, then it could remove the file, even if the file is still needed for the other table.

Thanks for the questions @pvary. I haven't really thought about concurrent access to the tables from multiple catalogs. It did come up a little bit in our DR discussion in the dev list. I'm not sure this should be allowed.

@aokolnychyi - What are your thoughts on this?

rdblue · 2022-01-12T18:38:57Z

This looks good to me when tests are passing.

anuragmantri · 2022-01-12T18:55:03Z

Looks like the build failed. Is there a way to re-trigger the build without pushing a new change?

flyrain · 2022-01-12T18:56:45Z

Looks like this build errors happened in multiple PRs. I saw it #3745 as well. It could be an issue in build infra or some changes in Gradle. @rdblue

A problem occurred configuring root project 'iceberg'.
31
> Could not resolve all files for configuration ':classpath'.
32
   > Could not resolve me.champeau.jmh:jmh-gradle-plugin:0.6.6.
33
     Required by:
34
         project :
35
      > Could not resolve me.champeau.jmh:jmh-gradle-plugin:0.6.6.
36
         > Could not get resource 'https://plugins.gradle.org/m2/me/champeau/jmh/jmh-gradle-plugin/0.6.6/jmh-gradle-plugin-0.6.6.module'.
37
            > Could not GET 'https://jcenter.bintray.com/me/champeau/jmh/jmh-gradle-plugin/0.6.6/jmh-gradle-plugin-0.6.6.module'. Received status code 502 from server: Bad Gateway
38
   > Could not resolve org.apache.logging.log4j:log4j-core:2.17.0.

bryanck · 2022-01-12T20:21:13Z

Seems like jcenter is having issues... https://status.gradle.com

flyrain · 2022-01-13T07:29:37Z

Seems like jcenter is having issues... https://status.gradle.com

The Jcenter is back online now. The build should pass once it is retriggered.

RussellSpitzer

No idea why the Javadoc build is breaking, I re ran it several times through the workflow. But the links it can't get the resource from also don't work for me.

RussellSpitzer · 2022-01-13T17:46:47Z

Tests Passed and Merged! Thanks @anuragmantri and all reviewers!

anuragmantri · 2022-01-13T17:49:01Z

Thanks to @aokolnychyi, the original author of the PR and everyone for the review!

github-actions bot added API hive labels Jan 5, 2022

RussellSpitzer reviewed Jan 6, 2022

View reviewed changes

anuragmantri force-pushed the register-table branch from 1e86f9e to 72f3128 Compare January 6, 2022 06:06

szehon-ho requested changes Jan 6, 2022

View reviewed changes

flyrain reviewed Jan 6, 2022

View reviewed changes

anuragmantri force-pushed the register-table branch from 72f3128 to cedc545 Compare January 7, 2022 23:56

rdblue reviewed Jan 9, 2022

View reviewed changes

bryanck reviewed Jan 10, 2022

View reviewed changes

API: Register existing tables in Iceberg HiveCatalog

1f1c6df

anuragmantri force-pushed the register-table branch from cedc545 to 1f1c6df Compare January 12, 2022 18:06

rdblue approved these changes Jan 12, 2022

View reviewed changes

RussellSpitzer approved these changes Jan 13, 2022

View reviewed changes

RussellSpitzer merged commit 643a8ac into apache:master Jan 13, 2022

szehon-ho mentioned this pull request Jun 29, 2022

Support catalog method to set table metadata #5163

Closed

alexjo2144 mentioned this pull request Aug 9, 2022

Add support for creating an Iceberg table from existing table content trinodb/trino#13552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Register existing tables in Iceberg HiveCatalog #3851

API: Register existing tables in Iceberg HiveCatalog #3851

anuragmantri commented Jan 5, 2022

RussellSpitzer Jan 6, 2022

anuragmantri Jan 6, 2022

pvary commented Jan 6, 2022

szehon-ho left a comment

szehon-ho Jan 6, 2022

flyrain Jan 6, 2022

anuragmantri Jan 7, 2022

szehon-ho Jan 6, 2022

flyrain Jan 6, 2022

anuragmantri Jan 7, 2022

flyrain Jan 6, 2022

flyrain Jan 6, 2022

anuragmantri Jan 7, 2022

anuragmantri commented Jan 7, 2022

pvary commented Jan 8, 2022

rdblue Jan 9, 2022

bryanck Jan 10, 2022

bryanck Jan 10, 2022

anuragmantri Jan 12, 2022

bryanck Jan 12, 2022

anuragmantri commented Jan 12, 2022

rdblue commented Jan 12, 2022

anuragmantri commented Jan 12, 2022 •

edited

Loading

flyrain commented Jan 12, 2022

bryanck commented Jan 12, 2022

flyrain commented Jan 13, 2022

RussellSpitzer left a comment

RussellSpitzer commented Jan 13, 2022

anuragmantri commented Jan 13, 2022

		String newMetadataLocation = base == null && metadata.metadataFileLocation() != null ?
		metadata.metadataFileLocation() : writeNewMetadata(metadata, currentVersion() + 1);

API: Register existing tables in Iceberg HiveCatalog #3851

API: Register existing tables in Iceberg HiveCatalog #3851

Conversation

anuragmantri commented Jan 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Jan 6, 2022

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuragmantri commented Jan 7, 2022

pvary commented Jan 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuragmantri commented Jan 12, 2022

rdblue commented Jan 12, 2022

anuragmantri commented Jan 12, 2022 • edited Loading

flyrain commented Jan 12, 2022

bryanck commented Jan 12, 2022

flyrain commented Jan 13, 2022

RussellSpitzer left a comment

Choose a reason for hiding this comment

RussellSpitzer commented Jan 13, 2022

anuragmantri commented Jan 13, 2022

anuragmantri commented Jan 12, 2022 •

edited

Loading