-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: add sort order id to content file #1975
Conversation
The field should be optional because v1 manifests will not have the order field. Iceberg will read the value as null, so I think it makes sense to use null. And you're right about not storing it for position deletes.
We want the ID. Sort orders are attached to table metadata, so loading the order should be a simple hash map lookup.
No. Engines must specify which sort order was used to write a file explicitly. So this needs to be exposed in the DataFile and DeleteFile builders. By default, we should write either null or 0 (unordered). Probably null. |
@@ -268,6 +271,11 @@ public Builder withEncryptionKeyMetadata(EncryptionKeyMetadata newKeyMetadata) { | |||
return withEncryptionKeyMetadata(newKeyMetadata.buffer()); | |||
} | |||
|
|||
public Builder withSortOrderId(int newSortOrderId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about withSortOrder
? Is there a case where we would only have the ID and not the SortOrder
? Just trying to think of what will be convenient. We could always expose both methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we change this to withSortOrder
and store sortOrder
directly, copy()
of this Builder might have problem since DataFile
only has sortOrderId()
but no sortOrder()
so we couldn't copy directly; we probably also couldn't add sortOrder()
to ContentFile
/DataFile
since the file created from Avro reflection will not have the struct without looking up table metadata?
But if withSortOrder
essentially only extracts the id and does the same thing as withSortOrderId
here, then it will definitely be more convenient. I'll update to do that but please let me know if I didn't understand correctly!
|
||
int PARTITION_ID = 102; | ||
String PARTITION_NAME = "partition"; | ||
String PARTITION_DOC = "Partition data tuple, schema based on the partition spec"; | ||
// NEXT ID TO ASSIGN: 140 | ||
// NEXT ID TO ASSIGN: 141 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also need to add this to the spec.
* when they share the same sort order id. | ||
*/ | ||
default Integer sortOrderId() { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this after equalityFieldIds
instead of here? The copy methods should be last because they aren't fields in the file.
* This information will be useful for merging data and equality delete files more efficiently | ||
* when they share the same sort order id. | ||
*/ | ||
default Integer sortOrderId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to add it to ensure backward compatibility in case people implement their own ContentFile
, do you think it's unlikely so it's safe for us to not have default?
Thank you for the response!
I guess in order to do that, we may need to add the file's sort order map in
(Sorry for the naive question) I guess the sort order needs to be decided when building the writer (e.g. add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, the only question I had was about returning null or 0, but Ryan's comment answered my question.
@@ -124,6 +124,14 @@ | |||
*/ | |||
List<Integer> equalityFieldIds(); | |||
|
|||
/** | |||
* Returns the sort order id of this file, which describes how the file is ordered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: <p>
after the line
@@ -141,4 +149,5 @@ | |||
* null value counts, or nan value counts | |||
*/ | |||
F copyWithoutStats(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you remove this whitespace change? It isn't needed and could cause conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, this is the only issue and it isn't worth blocking.
@@ -268,6 +271,13 @@ public Builder withEncryptionKeyMetadata(EncryptionKeyMetadata newKeyMetadata) { | |||
return withEncryptionKeyMetadata(newKeyMetadata.buffer()); | |||
} | |||
|
|||
public Builder withSortOrder(SortOrder newSortOrder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Looks great, thanks @yyanyy! |
This change adds sort_order_id to content file, and allow read and write of this attribute in manifest entry. The logic of populating sort order id in writers from table attribute is not included here and will be the main focus for the next PR, as doing so will likely requires a lot of signature changes.
Questions
unsorted_order
): decided as nullable for now, as I think positional delete files shouldn't have sort order (since they should be sorted by file_path and position per spec)