Core: Make Metadata tables serializable #2046

pvary · 2021-01-07T16:15:51Z

When used from Hive queries it would be useful to serialize the tables at the time of the query compilation for multiple reasons:

If we use the same snapshot during the query execution we could have consistent results
If we have do not have to access the catalog during the query execution then we can save HMS calls

The Serialization is implemented for BaseTables in #1920. This PR aims to do the same for the Metadata tables too.

Things which might worth to check:

Moved the SerializationUtil class to the core package - currently needed only for the tests, but I thought this would be ok
Every writeReplace() method is exactly the same for the specific types for metadata tables. Might worth to consider moving ops, table, name to the BaseMetadataTable. Did not do this because the change I did is not significant compared to the other quasi duplicated code, and there might be other reasons I am not aware of which would prevent this refactor.

pvary · 2021-01-08T08:13:51Z

@marton-bod, @lcspinter: Could you please review?
Thanks,
Peter

lcspinter

LGTM (non-binding)

Just a couple of observations.
We store the TableOperations, Table and name redundantly in each child class of BaseMetadataTable? Is there any reason we cannot move them one level higher?

Should we consider the introduction of a new ancestor class (SerializableBaseTable) for BaseTable and BaseMetadataTable? In my opinion, it would improve the readability of the code.

@rdblue What do you think?

pvary · 2021-01-08T19:25:33Z

@openinx, @aokolnychyi: After talking with @rdblue, he said you might be interested in reviewing this change.

Would you be so kind to review?

Thanks,
Peter

openinx · 2021-01-11T12:32:16Z

@pvary I skimmed this PR, seems I need more background to understand this change. Let me see the previous committed PRs.

pvary · 2021-01-11T12:45:14Z

@pvary I skimmed this PR, seems I need more background to understand this change. Let me see the previous committed PRs.

Thanks @openinx for taking the time to check the PR!
Feel free to ask any questions here/or on slack/or in email if you feel it is easier than digging up everything, I would be happy to answer them!

I would like to give some context - hope this helps:
With Hive, and maybe even for other execution engines too, the query compilation and the query execution happens on different nodes and we are only sending serialized data between the them. The execution also could happen in a distributed mode and it is unnecessary (and even problematic) for every executor node to look-up the table data from the Catalogs. If during the compilation we read the table data from the Catalog and then serialize, then the executor nodes do not have to have access to the Catalog, and it could be enough for them to have S3 access to read the snapshot data themselves.

In nutshell what we are trying to archive here to have a way to Serialize/Deserialize not only BaseTable-s, but every MetadataTable as well.

Thanks,
Peter

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java

core/src/main/java/org/apache/iceberg/util/SerializationUtil.java

rdblue · 2021-01-20T17:27:48Z

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java

@@ -186,4 +194,34 @@ public Transaction newTransaction() {
  public String toString() {
    return name();
  }
+
+  abstract Object writeReplace();


This object has access to table(), io(), and name() already. Would it be easier to expose protected metadataLocation() and metadataTableType() methods instead of writeReplace() in each implementation? Then this method could be implemented here.

Followed your recommendation. LGTM

The thing is that metadataLocation() is exactly the same for every implementation, but we need this if we do not want to relax the private restriction on ops. We might be further simplify the code if we move ops to BaseMetadataTable.

What do you think? Or it would be a bigger change which do no worth to do?

We could make ops a protected method. Let's get this in and we can clean that up later.

github-actions bot added core MR labels Jan 7, 2021

lcspinter reviewed Jan 8, 2021

View reviewed changes

Make Metadata tables serializable

61251ec

pvary force-pushed the serialize branch from 4a93a81 to 61251ec Compare January 19, 2021 09:04

rdblue reviewed Jan 20, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 20, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/util/SerializationUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 20, 2021

View reviewed changes

pvary force-pushed the serialize branch 2 times, most recently from dc21ead to 8088631 Compare January 21, 2021 11:52

Addessed review comments, recheck

48af8f9

pvary force-pushed the serialize branch from 8088631 to 48af8f9 Compare January 21, 2021 14:21

rdblue approved these changes Jan 22, 2021

View reviewed changes

rdblue merged commit 9108ef4 into apache:master Jan 22, 2021

pvary deleted the serialize branch January 27, 2021 14:28

openinx mentioned this pull request Feb 2, 2021

Flink: Initial implementation of Flink source with the new FLIP-27 source interface #2105

Closed

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

Core: Make Metadata tables serializable (apache#2046)

c22ffc5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Make Metadata tables serializable #2046

Core: Make Metadata tables serializable #2046

pvary commented Jan 7, 2021

pvary commented Jan 8, 2021

lcspinter left a comment

pvary commented Jan 8, 2021

openinx commented Jan 11, 2021

pvary commented Jan 11, 2021

rdblue Jan 20, 2021

pvary Jan 21, 2021

rdblue Jan 22, 2021

Core: Make Metadata tables serializable #2046

Core: Make Metadata tables serializable #2046

Conversation

pvary commented Jan 7, 2021

pvary commented Jan 8, 2021

lcspinter left a comment

Choose a reason for hiding this comment

pvary commented Jan 8, 2021

openinx commented Jan 11, 2021

pvary commented Jan 11, 2021

rdblue Jan 20, 2021

Choose a reason for hiding this comment

pvary Jan 21, 2021

Choose a reason for hiding this comment

rdblue Jan 22, 2021

Choose a reason for hiding this comment