-
Notifications
You must be signed in to change notification settings - Fork 454
GH-502: Clarify Int96 status and add recommended ordering #504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * INT96 (only used for legacy timestamps) - undefined(+) | ||
| * FLOAT - signed comparison of the represented value (*) | ||
| * DOUBLE - signed comparison of the represented value (*) | ||
| * BYTE_ARRAY - unsigned byte-wise comparison | ||
| * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison | ||
| * | ||
| * (+) While the INT96 type has been deprecated, at the time of writing it is | ||
| * still used in many legacy systems. If a Parquet implementation chooses | ||
| * to write statistics for INT96 columns, it is recommended to order them | ||
| * according to the legacy rules: | ||
| * - compare the last 4 bytes (days) as a little-endian 32-bit signed integer | ||
| * - if equal last 4 bytes, compare the first 8 bytes as a little-endian | ||
| * 64-bit signed integer (nanos) | ||
| * See https://github.com/apache/parquet-format/issues/502 for more details | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb how about wording this more strongly?
| * INT96 (only used for legacy timestamps) - undefined(+) | |
| * FLOAT - signed comparison of the represented value (*) | |
| * DOUBLE - signed comparison of the represented value (*) | |
| * BYTE_ARRAY - unsigned byte-wise comparison | |
| * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison | |
| * | |
| * (+) While the INT96 type has been deprecated, at the time of writing it is | |
| * still used in many legacy systems. If a Parquet implementation chooses | |
| * to write statistics for INT96 columns, it is recommended to order them | |
| * according to the legacy rules: | |
| * - compare the last 4 bytes (days) as a little-endian 32-bit signed integer | |
| * - if equal last 4 bytes, compare the first 8 bytes as a little-endian | |
| * 64-bit signed integer (nanos) | |
| * See https://github.com/apache/parquet-format/issues/502 for more details | |
| * | |
| * INT96 (only used for legacy timestamps) - timestamp logical comparison (+) | |
| * FLOAT - signed comparison of the represented value (*) | |
| * DOUBLE - signed comparison of the represented value (*) | |
| * BYTE_ARRAY - unsigned byte-wise comparison | |
| * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison | |
| * | |
| * (+) While the INT96 type has been deprecated, at the time of writing it is | |
| * still used in many legacy systems. The only logical type stored in INT96 | |
| * is a timestamp defined as 8 bytes signed little-endian nanos followed by | |
| * 4 bytes signed little-endian julian days. | |
| * See https://github.com/apache/parquet-format/issues/502 for more details | |
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer the original wording as it a) leaves the ordering undefined (and thus isn't really a change to the specification) and b) spells out exactly how to perform the in-the-wild ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @etseidl's suggestion. BTW, I think we need to avoid saying only logical type stored in INT96 since literally it is not a Parquet logical type or (deprecated) converted type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etseidl I agree that leaving it undefined is an advantage in not changing the spec. The disadvantage is that readers will have less confidence on the meaning of this field when it exists.
@wgtmac I see INT96 as a weird physical type. Had it been a real physical type it would have a logical type TIMESTAMP that would signify that top 64 bits are nanos and bottom 32 bits are julian days. It could also have a logical type IntType{96, true/false} which could be used to store numbers greater than int64max. In the former case the ordering would be what we propose today and in the latter case the ordering would be signed comparison of the 96 bit signed or unsigned integer. The reality is that there is no logical type ever defined for INT96 and its logical type is defacto the timestamp representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @etseidl is referring to my original proposed change (not the wording prior to this PR)
@alamb how about wording this more strongly?
In terms of changing the wording from
- * INT96 (only used for legacy timestamps) - undefined(+)
+ * INT96 (only used for legacy timestamps) - timestamp logical comparison (+)I would personally read this as a change in the spec, which obligated writers to use the new ordering. I was under the impression we didn't have consensus on making that change
I prefer leaving the wording as undefined personally.
The disadvantage is that readers will have less confidence on the meaning of this field when it exists.
This might be a good thing given that we have instances where old writers wrote it incorrectly may not be correct.
|
@wgtmac and @mkornacker can you review and merge this PR if appropriate? It appears to have been approved / agreed by all interested parties, and does not actually change the spec, but documents some existing behavior to improve ecosystem compatibility |
|
@alamb Sorry for a late reply. I just checked the discussion on the dev ML but it seems no consensus was reached, isn't it? cc @emkornfield |
I reviewed this mailing list thread: https://lists.apache.org/thread/tjsz7fo8vlmgtdrwjthtok4v8jf2ntx0 My understanding is that there is consensus that we did not want to change the spec However, I also think there is support for clarification of the situation. Since (in my opinion) this PR does not change the spec, only clarifies what other writers (currently) do, it is an improvement in general |
|
Just went through the change again and I agree with you. Let me merge this. Thanks @alamb! |
|
Thanks for merging this @wgtmac and for everyones help |
Rationale for this change
What changes are included in this PR?
Do these changes have PoC implementations?
No -- in my mind this does not change the meaning or intent of the parquet spec, but instead adds clarification to help other implementers