Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Make Metadata tables serializable #2046

Merged
merged 2 commits into from
Jan 22, 2021
Merged

Conversation

pvary
Copy link
Contributor

@pvary pvary commented Jan 7, 2021

When used from Hive queries it would be useful to serialize the tables at the time of the query compilation for multiple reasons:

  • If we use the same snapshot during the query execution we could have consistent results
  • If we have do not have to access the catalog during the query execution then we can save HMS calls

The Serialization is implemented for BaseTables in #1920. This PR aims to do the same for the Metadata tables too.

Things which might worth to check:

  • Moved the SerializationUtil class to the core package - currently needed only for the tests, but I thought this would be ok
  • Every writeReplace() method is exactly the same for the specific types for metadata tables. Might worth to consider moving ops, table, name to the BaseMetadataTable. Did not do this because the change I did is not significant compared to the other quasi duplicated code, and there might be other reasons I am not aware of which would prevent this refactor.

@pvary
Copy link
Contributor Author

pvary commented Jan 8, 2021

@marton-bod, @lcspinter: Could you please review?
Thanks,
Peter

Copy link
Contributor

@lcspinter lcspinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding)

Just a couple of observations.
We store the TableOperations, Table and name redundantly in each child class of BaseMetadataTable? Is there any reason we cannot move them one level higher?

Should we consider the introduction of a new ancestor class (SerializableBaseTable) for BaseTable and BaseMetadataTable? In my opinion, it would improve the readability of the code.

@rdblue What do you think?

@pvary
Copy link
Contributor Author

pvary commented Jan 8, 2021

@openinx, @aokolnychyi: After talking with @rdblue, he said you might be interested in reviewing this change.

Would you be so kind to review?

Thanks,
Peter

@openinx
Copy link
Member

openinx commented Jan 11, 2021

@pvary I skimmed this PR, seems I need more background to understand this change. Let me see the previous committed PRs.

@pvary
Copy link
Contributor Author

pvary commented Jan 11, 2021

@pvary I skimmed this PR, seems I need more background to understand this change. Let me see the previous committed PRs.

Thanks @openinx for taking the time to check the PR!
Feel free to ask any questions here/or on slack/or in email if you feel it is easier than digging up everything, I would be happy to answer them!

I would like to give some context - hope this helps:
With Hive, and maybe even for other execution engines too, the query compilation and the query execution happens on different nodes and we are only sending serialized data between the them. The execution also could happen in a distributed mode and it is unnecessary (and even problematic) for every executor node to look-up the table data from the Catalogs. If during the compilation we read the table data from the Catalog and then serialize, then the executor nodes do not have to have access to the Catalog, and it could be enough for them to have S3 access to read the snapshot data themselves.

In nutshell what we are trying to archive here to have a way to Serialize/Deserialize not only BaseTable-s, but every MetadataTable as well.

Thanks,
Peter

@@ -186,4 +194,34 @@ public Transaction newTransaction() {
public String toString() {
return name();
}

abstract Object writeReplace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This object has access to table(), io(), and name() already. Would it be easier to expose protected metadataLocation() and metadataTableType() methods instead of writeReplace() in each implementation? Then this method could be implemented here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed your recommendation. LGTM

The thing is that metadataLocation() is exactly the same for every implementation, but we need this if we do not want to relax the private restriction on ops. We might be further simplify the code if we move ops to BaseMetadataTable.

What do you think? Or it would be a bigger change which do no worth to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make ops a protected method. Let's get this in and we can clean that up later.

@pvary pvary force-pushed the serialize branch 2 times, most recently from dc21ead to 8088631 Compare January 21, 2021 11:52
@rdblue rdblue merged commit 9108ef4 into apache:master Jan 22, 2021
@pvary pvary deleted the serialize branch January 27, 2021 14:28
XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants