-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-686: Add Order to store the order used for min/max stats. #46
Conversation
@isnotinvain and @julienledem, could you have a look? |
src/main/thrift/parquet.thrift
Outdated
CUSTOM3 = 4; | ||
CUSTOM4 = 5; | ||
CUSTOM5 = 6; | ||
CUSTOM6 = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of this mechanism for extending ordering.
How about we define Collation as a string defined by the unicode standard as in your link bellow:
See http://userguide.icu-project.org/collation/api#TOC-Instantiating-the-Predefined-Collators
And we define "BINARY" as the unsigned order and "SIGNED_BINARY" as the signed one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or possibly have more of a union style struct:
enum CollationStandard {
BINARY, // only "SIGNED" and "UNSIGNED" are accepted
UNICODE; // must provide locale
}
struct Collation {
1: required CollationStandard std;
2: required string name;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this union style struct approach as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem this enum is trying to solve is to avoid writing collation identifier strings in the metadata of each page. The approach I took was to keep that field small with an enum and translate the custom symbols to collating sequence identifier strings using metadata in the footer.
If we used the union-style struct approach, then "signed", "unsigned", or the identifier string would be present in every page. At a minimum, I think signed and unsigned should be symbols and only custom strings would be in every page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the sort order should be allowed to change per page. So it should not be in the page metadata but in the footer per column. For example in ColumnMetaData or in SchemaElement.
I would tend to pick SchemaElement.
As a separate note, if it is an id to lookup in a table then we should make it an integer instead of an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just require the collation identifier string to be present when users are not using signed binary / unsigned binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach for the most part looks good to me. Added some minor comments.
src/main/thrift/parquet.thrift
Outdated
* If this field is missing, the order must be the signed ordering for | ||
* backward-compatibility. | ||
*/ | ||
5: optional Order order; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if there would be a scenario where users want to store multiple ordering stats? (e.g. to support multiple kinds of queries?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will need them. Can you think of a specific use case for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't have a concrete use-case for it right now but I think it might be possible that users want to support multiple kinds of queries over their data (one with locale = x, one with locale = y)? Don't think we need to handle this right now though.
src/main/thrift/parquet.thrift
Outdated
|
||
/** | ||
* A string that identifies the order denoted by the place-holder order enum | ||
* symbol. Ordering descriptors should be well-known identifiers, such as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be explicit about exactly what the identifiers should be? The comment as it stands now seems open to interpretation. If we explicitly say the identifiers should be only the locale keywords in ICU-54, that allows various language implementations to code on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we don't have an implementation that supports custom collating sequences, so I'm hesitant to be very prescriptive about identifying them. The implementation I envision is allowing callers to pass a Comparator and an identifier String, without the Parquet library controlling sorting. If that's something we can agree on, then this shouldn't be an area where we need to impose requirements.
src/main/thrift/parquet.thrift
Outdated
CUSTOM3 = 4; | ||
CUSTOM4 = 5; | ||
CUSTOM5 = 6; | ||
CUSTOM6 = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if I understand right, we can't have more than 6 custom orders for a given parquet file right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Sorry for the delay, I didn't get a notification that there were comments. I'll look through and respond. Thanks for reviewing, everyone! |
@julienledem, @piyushnarang, can you both take another look at this? I'd like to get this fixed so we can release it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes for the most part look good to me. Also think an enum for CollationStandard would be nice.
src/main/thrift/parquet.thrift
Outdated
CUSTOM3 = 4; | ||
CUSTOM4 = 5; | ||
CUSTOM5 = 6; | ||
CUSTOM6 = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just require the collation identifier string to be present when users are not using signed binary / unsigned binary?
src/main/thrift/parquet.thrift
Outdated
* If this field is missing, the order must be the signed ordering for | ||
* backward-compatibility. | ||
*/ | ||
5: optional Order order; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't have a concrete use-case for it right now but I think it might be possible that users want to support multiple kinds of queries over their data (one with locale = x, one with locale = y)? Don't think we need to handle this right now though.
src/main/thrift/parquet.thrift
Outdated
@@ -197,18 +197,78 @@ enum FieldRepetitionType { | |||
REPEATED = 2; | |||
} | |||
|
|||
/** Identifier for the sort order used to produce min and max values. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this also apply to the sorting_columns_ in the row groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sorting columns may need a separate update. These identifiers could potentially be used there, but I don't think the ordering (or direction) is currently stored.
5adc3dc
to
c3ef2e7
Compare
I updated this in response to @julienledem's comments. I think he's right that we only need to store the order once for each column, so this now adds In the new commit, |
c3ef2e7
to
f4a25da
Compare
I've updated the last commit to include the following:
I had initially added |
src/main/thrift/parquet.thrift
Outdated
0: required Order order; | ||
|
||
/** Whether the order used is ascending (true) or descending (false) */ | ||
1: required boolean is_ascending; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SortingColumn already has this which defines the direction of the sort in the file:
4: optional list<SortingColumn> sorting_columns
In the case of statistics I don't think we should add an 'is_ascending' field in ColumnOrder as this basically just swaps min and max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me.
src/main/thrift/parquet.thrift
Outdated
*/ | ||
struct ColumnOrder { | ||
/** The order used for this column */ | ||
0: required Order order; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in thrift we usually start field orderings at number 1 not 0. I don't remember why, but I see that we're doing that in the other structs so I think we should do it here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/main/thrift/parquet.thrift
Outdated
@@ -576,5 +631,8 @@ struct FileMetaData { | |||
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) | |||
**/ | |||
6: optional string created_by | |||
|
|||
/** Sort order used for each column in this file */ | |||
7: optional list<ColumnOrder> column_orders; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know from 1 ColumnOrder which column it applies to? It's position in this list has to match the position in the schema
list above? If that's the case I think we should add a comment for that here. I guess this has to be here to avoid repeating it in each Statistics object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one ColumnOrder
for each column, in the same order as the column chunks (which is required to match the order in the schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to add an optional column_order field in SchemaElement which would set only for primitive types.
10: optional ColumnOrder column_order;
https://github.com/rdblue/parquet-format/blob/9962df8e0ea85858cf032451d0ee83ec3f4d39fe/src/main/thrift/parquet.thrift#L260
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning for not adding this to the schema is that I don't think I'd add this to the string-based schema definition. I want to keep those roughly the same by not adding things to SchemaElement that aren't in the schema strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
src/main/thrift/parquet.thrift
Outdated
* identify the actual order used for min, max, and soring values. | ||
* | ||
* This identifier should follow one of the following formats: | ||
* * 'icu54:<locale-keyword>' - ICU 54 ordering for the ICU 54 locale keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about something like:
struct ColumnOrder {
1: required OrderType orderType
2: optional string orderSubType
3: required boolean is_ascending
}
enum OrderType {
SIGNED, UNSIGNED, ICU_54, OTHER // more can be added as needed
}
and orderSubType
isn't set for SIGNED/UNSIGNED ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if we really want to avoid the confusion of mixing specific orders and classes of orders, we can use a union instead:
struct ColumnOrder {
1: required Order order
2: required boolean is_ascending
}
struct Signed {}
struct Unsigned {}
struct ICU_54 {
1: required string locale_keyword
}
struct Other {
1: required string name
}
union Order {
1: Signed signed
2: Unsigned unsigned
3: ICU_54 icu_54
4: Other
// add more as needed
}
This will generate an Order struct that can only be one of {Signed / Unsigned / ICU_54(locale_keyword) / Other(name)
and avoids having to deal with the fact that Signed / Unsigned require no additional info, and it also means we can add extra info to each type of ordering as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the other symbols. We can revisit this when we know more about the requirements of processing engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to finalize this aspect of the discussion.
the signed/unsigned aspect is what's blocking impala/sparksql right now.
When they move to implement collation it will be easier to converge on the collation string spec.
Let's open a separate JIRA for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, should we remove the commented out fields + their docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue It sounds like Union? +1 from me as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we've confirmed that the union is forward-compatible. I haven't had time to check, so if either of you wants to that would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for thrift 0.9.3. I don't have thrift 0.7 working on my laptop right now.
If you want to try with thrift 0.7:
apache/parquet-java#405
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related: we should just update parquet-format to thrift 0.9.3
#50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue: works for thrift 0.7 per travis-ci: apache/parquet-java#405
This adds a list of ColumnOrder to FileMetaData. The list should include one ColumnOrder for each column in the order in the same order as the file's column chunks. Each ColumnOrder is a struct with an Order enum (signed, unsigned, or custom) and an optional string to describe custom orders.
f4a25da
to
4534062
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
My last question is whether column_order should be in SchemaElement (see comment above) rather than a list in FileMetadata.
LogicalTypes.md
Outdated
@@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error. | |||
A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both | |||
`scale` and `precision` fields set, even if scale is 0 by default. | |||
|
|||
The sort order used for `DECIMAL` values must be `SIGNED`. The order is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is" or "must be" seem superfluous. It seems like a logical conclusion, so I'd keep "is".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
src/main/thrift/parquet.thrift
Outdated
/** Descriptor for the order used for min, max, and sorting values in a column | ||
*/ | ||
struct ColumnOrder { | ||
/** The order used for this column */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add to the comment that this has to be ascending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I think min and max are clear enough.
src/main/thrift/parquet.thrift
Outdated
/** | ||
* Identifiers for custom orderings, to be defined in the ColumnOrder struct. | ||
*/ | ||
//CUSTOM = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamps are currently stored in INT96 by Impala and Hive and with INT96 depreciation on the horizon, can we enable custom ordering for binary / fixed types in this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would CUSTOM mean in that context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impala and Hive store timestamps as a tuple (int64_t, uint32_t), where the first element is the time of the day in nanoseconds, and the second is the day according to the Julian calendar. Ordering these means to sort them by the second element first (the day), then by the first (the time). None of the other orderings seem to reflect this, so it would be good if we could at least set it to CUSTOM to signal that any validation will fail unless the client knows how to interpret them.
Does it make more sense to document something like IMPALA_TIMESTAMP as its own logical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely should change the situation as currently the values stored as INT96
aren't really interpretable as integers, better open a separate JIRA to untangle it a bit from the sorting discussion here.
@lekv I also tried to document the current state here: #49 would be nice if you also could have a short look if the formula is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy - I left some comments in #49. I agree that it's better to untangle / document Impala timestamps in a separate JIRA. Should we use PARQUET-861 for that, or shall I go ahead and make a new one?
@julienledem - Do you think it's better to defer adding "CUSTOM" orderings until adding the first LogicalType that needs one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the claim that INT96 aren't interpretable as integers. They don't map to a common hardware integer type but they're still logically a twos-complement integer.
I think on the Impala side we may end up implementing min-max stats for INT96 ahead of doing a larger rework of the timestamp data type, so it would be good to have clarity about what other implementations should do when encountered with INT96 min-max stats.
My take is that they should either ignore INT96 stats (if they don't want to implement a deprecated type) or use signed ordering. Parquet-mr doesn't do this currently. Does everyone agree that https://issues.apache.org/jira/browse/PARQUET-840 is a bug in parquet-mr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm assuming that signed INT96 ordering is equivalent to the logical IMPALA_TIMESTAMP ordering - I think this is true. because the least significant 8 bytes of the INT96 are the unsigned nanoseconds component and the most significant 4 bytes of the INT96 are the signed Julian day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timarmstrong if that's the case then we can just use SIGNED Ordering for them.
@lekv do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timarmstrong I assumed we are writing (nanoseconds, day) tuples, and the UNSIGNED order would not be equivalent to the logical ordering. Internally Impala stores them as (time, date) and we don't reverse those. Hive seems to do it the same way as Impala: https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/example/data/simple/NanoTime.java#L58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works out because INT96 is little-endian - the most significant 4 bytes of the INT96 line up with the 4 bytes of the date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I think we moved the remaining discussion points to separate JIRAs and this addresses the first use case of having the default UTF-8 ordering for strings and fix ordering of other unsigned logical types.
Follow up in separate JIRAS:
- int96
- collations for string type
LogicalTypes.md
Outdated
@@ -55,6 +57,8 @@ allows. | |||
implied by the `int32` and `int64` primitive types if no other annotation is | |||
present and should be considered optional. | |||
|
|||
The sort order used for signed integer types must be `SIGNED`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means that writers are responsible for setting order to SIGNED when writing a column of this type, right? What is the expected behaviour if a reader encounters a logical type of signed integer but UNSIGNED order? Is this invalid according to the spec, or does the sort order setting override the default sort for the logical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My interpretation is we should fail.
This is redundant with logical type but more explicit and deals with backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any implementation should ignore stats that it cannot use. I think Lars mentioned in the recent hangout that there are transforms that can be done in some cases to use stats that are written with some compatible orderings.
A simple example is where a UINT32 field has min=10 and max=1000. Since both are positive and not in the top half of the UINT32 range, we know that the same min/max are produced by both signed and unsigned comparison and the implementation can use it.
LogicalTypes.md
Outdated
@@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields. | |||
`UTF8` may only be used to annotate the binary primitive type and indicates | |||
that the byte array should be interpreted as a UTF-8 encoded character string. | |||
|
|||
The sort order used for `UTF8` strings must be `UNSIGNED` byte-wise comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful for implementers to note that this is equivalent to ordering by unicode code points, since that's a bit subtle. If you feel like that's overly verbose feel free to leave it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I thought the sort order for strings should be unsigned, from now on, but it's valid and possible for old files to have signed ordering. Should we clarify that both orderings are supported but one is preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stating that all UTF8 from now on should be written with UNSIGNED comparison. It is still the case that files without the SortOrder list used SIGNED.
LogicalTypes.md
Outdated
The sort order used for `DECIMAL` values must be `SIGNED`. The order is | ||
must be equivalent to signed comparison of decimal values. | ||
|
||
If the column uses `int32` or `int64` physical types, then signed comparison of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this clarification.
LogicalTypes.md
Outdated
|
||
If the column uses `int32` or `int64` physical types, then signed comparison of | ||
the integer values produces the correct ordering. If the physical type is | ||
fixed, then the correct ordering can be produced by flipping the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_len_byte_array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Fixed is used interchangeably with the full name elsewhere as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I'm not totally familiar with some of these conventions.
the integer values produces the correct ordering. If the physical type is | ||
fixed, then the correct ordering can be produced by flipping the | ||
most-significant bit in the first byte and then using unsigned byte-wise | ||
comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about decimals encoded into variable-length binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an optimization, but it is too long to describe here. The main point is that the order must match the signed order produced by comparing the decimal values. This paragraph only gives optimizations for easy types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for explaining.
@@ -144,6 +169,9 @@ example, there is no requirement that a large number of days should be | |||
expressed as a mix of months and days because there is not a constant | |||
conversion from days to months. | |||
|
|||
The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by | |||
the value of months, then days, then milliseconds with unsigned comparison. | |||
|
|||
## Embedded Types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to clarify what the expected sort order is for embedded types (or if we want to punt on that question, make it explicit that readers should ignore it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest commit, I clarified that there isn't a required ordering for embedded types.
src/main/thrift/parquet.thrift
Outdated
@@ -576,5 +628,8 @@ struct FileMetaData { | |||
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) | |||
**/ | |||
6: optional string created_by | |||
|
|||
/** Sort order used for each column in this file */ | |||
7: optional list<ColumnOrder> column_orders; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the plan for backwards/forwards compatibility.
It seems like a newer implementation can assume that stats were written in the old way if column_orders is unset (this might be useful still when reading old data with int or floating point stats) or in the new way if column_orders is set. But is there a plan to guarantee that old implementations will ignore the new stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if column_orders is missing that means they are written the old way.
In the current PR it looks like old versions of the library won't tell the difference between old and new stats. If we want to make sure they don't see the new stats, we need to put them in a different field in the Stats struct.
Do we want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to making them a different field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think this was clear from previous discussion, but not from the PR.
Readers must assume that the order is signed, unless this SortOrder
is set because that is what the Java library currently writes. The original min and max stats fields are still used because the majority of the values are valid. There are only a few types that are currently stored incorrectly:
- UTF8 strings
- Decimals (stored as fixed or bytes)
- Intervals
- UINTs
Unsigned ints and intervals aren't used in any engine that I know about. Decimals are used, but I doubt anyone is filtering based on the stats because there isn't a way to pass a decimal to the filter code path in Parquet Java. (You'd have to pass an encoded decimal Binary; same problem for intervals.)
That means that the only type with incorrect stats that is really a problem is UTF8. The reason why we didn't catch this problem sooner is because characters stored as a single byte without the msb set (including ASCII) have the same sort order using signed and unsigned comparison. This is why Parquet Java has a property to ignore the wrong sort order and use stats anyway.
Using a separate field for the new stats would mean that old readers can't use stats for UTF8 fields, even though there are lots of cases where it is fine, and it is not really worse than what they do already -- behavior no one noticed was wrong for a year. It would also require more format changes and could prevent older readers from using stats for signed fields that are correct.
thanks for the review @timarmstrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
any last comments?
I agree it would be useful if the stats were usable by older
implementations when they're compatible.
Is there any way that a new implementation can write out new-style stats
and force older implementations to ignore them? E.g. if I want to use some
special unicode collation in my query engine, but also want to read the
files via an older parquet-mr, which assumes all stats are the old style.
…On 15 February 2017 at 02:47, Ryan Blue ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/main/thrift/parquet.thrift
<#46>:
> @@ -576,5 +628,8 @@ struct FileMetaData {
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by
+
+ /** Sort order used for each column in this file */
+ 7: optional list<ColumnOrder> column_orders;
Sorry, I think this was clear from previous discussion, but not from the
PR.
Readers must assume that the order is signed, unless this SortOrder is
set because that is what the Java library currently writes. The original
min and max stats fields are still used because the majority of the values
are valid. There are only a few types that are currently stored incorrectly:
- UTF8 strings
- Decimals (stored as fixed or bytes)
- Intervals
- UINTs
Unsigned ints and intervals aren't used in any engine that I know about.
Decimals are used, but I doubt anyone is filtering based on the stats
because there isn't a way to pass a decimal to the filter code path in
Parquet Java. (You'd have to pass an encoded decimal Binary; same problem
for intervals.)
That means that the only type with incorrect stats that is really a
problem is UTF8. The reason why we didn't catch this problem sooner is
because characters stored as a single byte without the msb set (including
ASCII) have the same sort order using signed and unsigned comparison. This
is why Parquet Java has a property to ignore the wrong sort order and use
stats anyway.
Using a separate field for the new stats would mean that old readers can't
use stats for UTF8 fields, even though there are lots of cases where it is
fine, and it is not really worse than what they do already -- behavior no
one noticed was wrong for a year. It would also require more format changes
and could prevent older readers from using stats for signed fields that are
correct.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABGjqzo1Ku0yEvt3IH-NjbbKkg-ZKnGBks5rcdqQgaJpZM4Kwzz5>
.
|
@timarmstrong, even after this PR, there's no way to record that you're using a special unicode collation, so it isn't allowed. If we included the custom order identifier, then readers that know to only use stats when they support the order are fine because they can ignore them. Older readers would not produce correct results. This raises a good point about format compatibility. I think it is okay to continue using the current min and max fields because the result is no more broken than it is already. But, other collating sequences are forward-incompatible changes to the format if they use the same fields. Also, the proposals to add symbols to the That's enough to convince me that you guys were right all along and we should use new min and max fields. We can also store min and max in the old fields for older readers when compatible, but new stats that may have a different ordering shouldn't be available to older readers. I'll add a commit that does the following:
@isnotinvain and @julienledem, you'll probably want to take another look after the new commit. |
This avoids forward-compatibility problems.
adding new fields is fine by me. although we could rename min and max to min_signed and max_signed for clarity. Thrift allows renaming fields. |
@julienledem, I updated this to use a union. One last review? |
Union changes LGTM, thanks! +1 |
Thanks for the quick review, @isnotinvain! I'll merge early next week after everyone has had a chance to weigh in on the changes. |
Are we ready to merge this one? |
I think it's ready, but I thought @julienledem may want to have another look. |
I just tried out the parquet.thrift file by copying it into the Impala sources and trying to compile and gcc choked on ConvertedType.NULL.
Has anyone else tried to compile this with g++? We're currently using gcc-4.9.2 and thrift-0.9.0. I'll investigate further but I also thought someone may have seen this before. |
@rdblue - I missed it wasn't actually introduced in this commit. Apologies for the noise. I also noticed that the current layout of the comments in
Do we usually try to format the Thrift code in a way that the comments look nice in Python? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
Thanks!
* be Signed. In addition, min and max values for INTERVAL or DECIMAL stored | ||
* as fixed or bytes should be ignored. | ||
*/ | ||
7: optional list<ColumnOrder> column_orders; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a note that there is exactly one ColumnOrder per column as defined by the schema (1 column per leaf node in the schema)?
src/main/thrift/parquet.thrift
Outdated
union ColumnOrder { | ||
1: Signed SIGNED; | ||
2: Unsigned UNSIGNED; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for getting back at this. (I blame the jet lag for answering too fast on my previous comment)
Actually I'd suggest the following (to have a more Union like definition and less ENUM as Union):
/**
* sorted according to the physical type (considered signed or not)
* binary is in lexicographic order (assuming each byte is signed or unsigned)
* numerical types are sorted according to their natural order (signed of unsigned)
*/
struct PhysicalTypeColumnOrder {
1: required boolean signed;
}
/** Union containing the order used for min, max, and sorting values in a column */
union ColumnOrder {
1: PhysicalTypeColumnOrder physical;
}
and naturally in the future we would add the following as a second type in the union:
/** the data is sorted according to the provided collation */
struct CollatedColumnOrder {
1: required string collation_string;
}
* Sort order used for each column in this file. | ||
* | ||
* If this list is not present, then the order for each column is assumed to | ||
* be Signed. In addition, min and max values for INTERVAL or DECIMAL stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the second sentence. Why would we need to ignore decimal min/max values if the order is not present? Why does it say "should"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the order isn't present, then it was byte-wise signed comparsion that is useless. Readers should ignore the values.
b812eca
to
f878c34
Compare
@julienledem, update this. One more review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'll give others a day or two to weigh in and then commit this. |
Merged. Thanks for all of the discussion and reviews, everyone! |
PR apache#46 introduced ColumnOrder with the limitation that a reader should ignore stats for a column if the corresponding ColumnOrder in FileMetaData contains an unknown value. This change adds a special column order 'InvalidOrder' that can be used to in tests and should not be used otherwise. This change also fixes the paths in Makefile.
…ode paths I also did a bit of tidying / reorganization and giving interfaces more descriptive names. Author: Wes McKinney <wes@cloudera.com> Closes apache#46 from wesm/PARQUET-501 and squashes the following commits: 491aa89 [Wes McKinney] * Add a basic OutputStream abstract interface and an InMemoryOutputStream implementation for testing. * Refactor to use OutputStream on data encoding paths, reduce some code duplication in column-reader-test. * Collect all input/output classes into util/input.* and util/output.*. * Use int64_t in InputStream::Peek/Read.
This adds a new enum,
Order
, that will be set to the order used to produce the min and max values in allStatistics
objects (at the page level).Order
has 8 symbols:SIGNED
,UNSIGNED
, and 6 symbols for custom orderings. This also adds aCustomOrder
struct that is used to map the custom order symbols to string descriptors, such as order keywords used by ICU collating sequences.CustomOrder
mappings are stored in the file footer.