Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-686: Add Order to store the order used for min/max stats. #46

Closed
wants to merge 8 commits into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Nov 13, 2016

This adds a new enum, Order, that will be set to the order used to produce the min and max values in all Statistics objects (at the page level). Order has 8 symbols: SIGNED, UNSIGNED, and 6 symbols for custom orderings. This also adds a CustomOrder struct that is used to map the custom order symbols to string descriptors, such as order keywords used by ICU collating sequences. CustomOrder mappings are stored in the file footer.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 14, 2016

@isnotinvain and @julienledem, could you have a look?

CUSTOM3 = 4;
CUSTOM4 = 5;
CUSTOM5 = 6;
CUSTOM6 = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of this mechanism for extending ordering.
How about we define Collation as a string defined by the unicode standard as in your link bellow:
See http://userguide.icu-project.org/collation/api#TOC-Instantiating-the-Predefined-Collators
And we define "BINARY" as the unsigned order and "SIGNED_BINARY" as the signed one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or possibly have more of a union style struct:

enum CollationStandard {
  BINARY, // only "SIGNED" and "UNSIGNED" are accepted
  UNICODE; // must provide locale
}
struct Collation {
  1: required CollationStandard std;
  2: required string name;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this union style struct approach as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem this enum is trying to solve is to avoid writing collation identifier strings in the metadata of each page. The approach I took was to keep that field small with an enum and translate the custom symbols to collating sequence identifier strings using metadata in the footer.

If we used the union-style struct approach, then "signed", "unsigned", or the identifier string would be present in every page. At a minimum, I think signed and unsigned should be symbols and only custom strings would be in every page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the sort order should be allowed to change per page. So it should not be in the page metadata but in the footer per column. For example in ColumnMetaData or in SchemaElement.
I would tend to pick SchemaElement.
As a separate note, if it is an id to lookup in a table then we should make it an integer instead of an enum.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just require the collation identifier string to be present when users are not using signed binary / unsigned binary?

Copy link

@piyushnarang piyushnarang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach for the most part looks good to me. Added some minor comments.

* If this field is missing, the order must be the signed ordering for
* backward-compatibility.
*/
5: optional Order order;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if there would be a scenario where users want to store multiple ordering stats? (e.g. to support multiple kinds of queries?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will need them. Can you think of a specific use case for it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't have a concrete use-case for it right now but I think it might be possible that users want to support multiple kinds of queries over their data (one with locale = x, one with locale = y)? Don't think we need to handle this right now though.


/**
* A string that identifies the order denoted by the place-holder order enum
* symbol. Ordering descriptors should be well-known identifiers, such as the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be explicit about exactly what the identifiers should be? The comment as it stands now seems open to interpretation. If we explicitly say the identifiers should be only the locale keywords in ICU-54, that allows various language implementations to code on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we don't have an implementation that supports custom collating sequences, so I'm hesitant to be very prescriptive about identifying them. The implementation I envision is allowing callers to pass a Comparator and an identifier String, without the Parquet library controlling sorting. If that's something we can agree on, then this shouldn't be an area where we need to impose requirements.

CUSTOM3 = 4;
CUSTOM4 = 5;
CUSTOM5 = 6;
CUSTOM6 = 7;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I understand right, we can't have more than 6 custom orders for a given parquet file right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 7, 2016

Sorry for the delay, I didn't get a notification that there were comments. I'll look through and respond. Thanks for reviewing, everyone!

@rdblue
Copy link
Contributor Author

rdblue commented Jan 6, 2017

@julienledem, @piyushnarang, can you both take another look at this? I'd like to get this fixed so we can release it.

Copy link

@piyushnarang piyushnarang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes for the most part look good to me. Also think an enum for CollationStandard would be nice.

CUSTOM3 = 4;
CUSTOM4 = 5;
CUSTOM5 = 6;
CUSTOM6 = 7;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just require the collation identifier string to be present when users are not using signed binary / unsigned binary?

* If this field is missing, the order must be the signed ordering for
* backward-compatibility.
*/
5: optional Order order;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't have a concrete use-case for it right now but I think it might be possible that users want to support multiple kinds of queries over their data (one with locale = x, one with locale = y)? Don't think we need to handle this right now though.

@@ -197,18 +197,78 @@ enum FieldRepetitionType {
REPEATED = 2;
}

/** Identifier for the sort order used to produce min and max values. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also apply to the sorting_columns_ in the row groups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sorting columns may need a separate update. These identifiers could potentially be used there, but I don't think the ordering (or direction) is currently stored.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 6, 2017

I updated this in response to @julienledem's comments. I think he's right that we only need to store the order once for each column, so this now adds list<ColumnOrder> to the footer. The Statistics objects are unchanged because there's no need to add the order to each stats object in the format.

In the new commit, ColumnOrder is a struct 2 fields: the first field is order, which is an Order enum with SIGNED, UNSIGNED, and CUSTOM and the second field is custom_order, a string that describes the order used if order was CUSTOM.

@rdblue rdblue force-pushed the PARQUET-686-add-stats-ordering branch from c3ef2e7 to f4a25da Compare February 6, 2017 19:24
@rdblue
Copy link
Contributor Author

rdblue commented Feb 6, 2017

I've updated the last commit to include the following:

  • Added ColumnOrder.is_ascending boolean
  • Added a specific format for the custom_order string

I had initially added ICU54 to the Order enum, but it seems confusing to have Order include both specific orderings and classes of orderings, so I went with the string prefix option. We can change it if anyone is strongly opposed.

0: required Order order;

/** Whether the order used is ascending (true) or descending (false) */
1: required boolean is_ascending;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortingColumn already has this which defines the direction of the sort in the file:
4: optional list<SortingColumn> sorting_columns
In the case of statistics I don't think we should add an 'is_ascending' field in ColumnOrder as this basically just swaps min and max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

*/
struct ColumnOrder {
/** The order used for this column */
0: required Order order;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in thrift we usually start field orderings at number 1 not 0. I don't remember why, but I see that we're doing that in the other structs so I think we should do it here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -576,5 +631,8 @@ struct FileMetaData {
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by

/** Sort order used for each column in this file */
7: optional list<ColumnOrder> column_orders;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know from 1 ColumnOrder which column it applies to? It's position in this list has to match the position in the schema list above? If that's the case I think we should add a comment for that here. I guess this has to be here to avoid repeating it in each Statistics object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one ColumnOrder for each column, in the same order as the column chunks (which is required to match the order in the schema).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to add an optional column_order field in SchemaElement which would set only for primitive types.
10: optional ColumnOrder column_order;
https://github.com/rdblue/parquet-format/blob/9962df8e0ea85858cf032451d0ee83ec3f4d39fe/src/main/thrift/parquet.thrift#L260

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for not adding this to the schema is that I don't think I'd add this to the string-based schema definition. I want to keep those roughly the same by not adding things to SchemaElement that aren't in the schema strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

* identify the actual order used for min, max, and soring values.
*
* This identifier should follow one of the following formats:
* * 'icu54:<locale-keyword>' - ICU 54 ordering for the ICU 54 locale keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about something like:

struct ColumnOrder {
  1: required OrderType orderType
  2: optional string orderSubType
  3: required boolean is_ascending
}
enum OrderType {
  SIGNED, UNSIGNED, ICU_54, OTHER // more can be added as needed
}

and orderSubType isn't set for SIGNED/UNSIGNED ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if we really want to avoid the confusion of mixing specific orders and classes of orders, we can use a union instead:

struct ColumnOrder {
  1: required Order order
  2: required boolean is_ascending
}

struct Signed {}
struct Unsigned {}
struct ICU_54 {
  1: required string locale_keyword
}
struct Other {
  1: required string name
}
union Order {
  1: Signed signed
  2: Unsigned unsigned
  3: ICU_54 icu_54
  4: Other
  // add more as needed
}

This will generate an Order struct that can only be one of {Signed / Unsigned / ICU_54(locale_keyword) / Other(name) and avoids having to deal with the fact that Signed / Unsigned require no additional info, and it also means we can add extra info to each type of ordering as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the other symbols. We can revisit this when we know more about the requirements of processing engines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to finalize this aspect of the discussion.
the signed/unsigned aspect is what's blocking impala/sparksql right now.
When they move to implement collation it will be easier to converge on the collation string spec.
Let's open a separate JIRA for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should we remove the commented out fields + their docs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue It sounds like Union? +1 from me as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've confirmed that the union is forward-compatible. I haven't had time to check, so if either of you wants to that would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for thrift 0.9.3. I don't have thrift 0.7 working on my laptop right now.
If you want to try with thrift 0.7:
apache/parquet-java#405

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: we should just update parquet-format to thrift 0.9.3
#50

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue: works for thrift 0.7 per travis-ci: apache/parquet-java#405

This adds a list of ColumnOrder to FileMetaData. The list should include
one ColumnOrder for each column in the order in the same order as the
file's column chunks. Each ColumnOrder is a struct with an Order enum
(signed, unsigned, or custom) and an optional string to describe custom
orders.
@rdblue rdblue force-pushed the PARQUET-686-add-stats-ordering branch from f4a25da to 4534062 Compare February 7, 2017 16:40
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
My last question is whether column_order should be in SchemaElement (see comment above) rather than a list in FileMetadata.

LogicalTypes.md Outdated
@@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error.
A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
`scale` and `precision` fields set, even if scale is 0 by default.

The sort order used for `DECIMAL` values must be `SIGNED`. The order is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is" or "must be" seem superfluous. It seems like a logical conclusion, so I'd keep "is".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

/** Descriptor for the order used for min, max, and sorting values in a column
*/
struct ColumnOrder {
/** The order used for this column */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add to the comment that this has to be ascending?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I think min and max are clear enough.

/**
* Identifiers for custom orderings, to be defined in the ColumnOrder struct.
*/
//CUSTOM = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamps are currently stored in INT96 by Impala and Hive and with INT96 depreciation on the horizon, can we enable custom ordering for binary / fixed types in this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would CUSTOM mean in that context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impala and Hive store timestamps as a tuple (int64_t, uint32_t), where the first element is the time of the day in nanoseconds, and the second is the day according to the Julian calendar. Ordering these means to sort them by the second element first (the day), then by the first (the time). None of the other orderings seem to reflect this, so it would be good if we could at least set it to CUSTOM to signal that any validation will fail unless the client knows how to interpret them.

Does it make more sense to document something like IMPALA_TIMESTAMP as its own logical type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should change the situation as currently the values stored as INT96 aren't really interpretable as integers, better open a separate JIRA to untangle it a bit from the sorting discussion here.

@lekv I also tried to document the current state here: #49 would be nice if you also could have a short look if the formula is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy - I left some comments in #49. I agree that it's better to untangle / document Impala timestamps in a separate JIRA. Should we use PARQUET-861 for that, or shall I go ahead and make a new one?

@julienledem - Do you think it's better to defer adding "CUSTOM" orderings until adding the first LogicalType that needs one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the claim that INT96 aren't interpretable as integers. They don't map to a common hardware integer type but they're still logically a twos-complement integer.

I think on the Impala side we may end up implementing min-max stats for INT96 ahead of doing a larger rework of the timestamp data type, so it would be good to have clarity about what other implementations should do when encountered with INT96 min-max stats.

My take is that they should either ignore INT96 stats (if they don't want to implement a deprecated type) or use signed ordering. Parquet-mr doesn't do this currently. Does everyone agree that https://issues.apache.org/jira/browse/PARQUET-840 is a bug in parquet-mr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm assuming that signed INT96 ordering is equivalent to the logical IMPALA_TIMESTAMP ordering - I think this is true. because the least significant 8 bytes of the INT96 are the unsigned nanoseconds component and the most significant 4 bytes of the INT96 are the signed Julian day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timarmstrong if that's the case then we can just use SIGNED Ordering for them.
@lekv do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timarmstrong I assumed we are writing (nanoseconds, day) tuples, and the UNSIGNED order would not be equivalent to the logical ordering. Internally Impala stores them as (time, date) and we don't reverse those. Hive seems to do it the same way as Impala: https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/example/data/simple/NanoTime.java#L58

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works out because INT96 is little-endian - the most significant 4 bytes of the INT96 line up with the 4 bytes of the date

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
I think we moved the remaining discussion points to separate JIRAs and this addresses the first use case of having the default UTF-8 ordering for strings and fix ordering of other unsigned logical types.
Follow up in separate JIRAS:

  • int96
  • collations for string type

LogicalTypes.md Outdated
@@ -55,6 +57,8 @@ allows.
implied by the `int32` and `int64` primitive types if no other annotation is
present and should be considered optional.

The sort order used for signed integer types must be `SIGNED`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that writers are responsible for setting order to SIGNED when writing a column of this type, right? What is the expected behaviour if a reader encounters a logical type of signed integer but UNSIGNED order? Is this invalid according to the spec, or does the sort order setting override the default sort for the logical type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation is we should fail.
This is redundant with logical type but more explicit and deals with backward compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any implementation should ignore stats that it cannot use. I think Lars mentioned in the recent hangout that there are transforms that can be done in some cases to use stats that are written with some compatible orderings.

A simple example is where a UINT32 field has min=10 and max=1000. Since both are positive and not in the top half of the UINT32 range, we know that the same min/max are produced by both signed and unsigned comparison and the implementation can use it.

LogicalTypes.md Outdated
@@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields.
`UTF8` may only be used to annotate the binary primitive type and indicates
that the byte array should be interpreted as a UTF-8 encoded character string.

The sort order used for `UTF8` strings must be `UNSIGNED` byte-wise comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful for implementers to note that this is equivalent to ordering by unicode code points, since that's a bit subtle. If you feel like that's overly verbose feel free to leave it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I thought the sort order for strings should be unsigned, from now on, but it's valid and possible for old files to have signed ordering. Should we clarify that both orderings are supported but one is preferred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stating that all UTF8 from now on should be written with UNSIGNED comparison. It is still the case that files without the SortOrder list used SIGNED.

LogicalTypes.md Outdated
The sort order used for `DECIMAL` values must be `SIGNED`. The order is
must be equivalent to signed comparison of decimal values.

If the column uses `int32` or `int64` physical types, then signed comparison of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this clarification.

LogicalTypes.md Outdated

If the column uses `int32` or `int64` physical types, then signed comparison of
the integer values produces the correct ordering. If the physical type is
fixed, then the correct ordering can be produced by flipping the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_len_byte_array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed is used interchangeably with the full name elsewhere as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I'm not totally familiar with some of these conventions.

the integer values produces the correct ordering. If the physical type is
fixed, then the correct ordering can be produced by flipping the
most-significant bit in the first byte and then using unsigned byte-wise
comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about decimals encoded into variable-length binary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an optimization, but it is too long to describe here. The main point is that the order must match the signed order produced by comparing the decimal values. This paragraph only gives optimizations for easy types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for explaining.

@@ -144,6 +169,9 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.

The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
the value of months, then days, then milliseconds with unsigned comparison.

## Embedded Types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify what the expected sort order is for embedded types (or if we want to punt on that question, make it explicit that readers should ignore it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit, I clarified that there isn't a required ordering for embedded types.

@@ -576,5 +628,8 @@ struct FileMetaData {
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by

/** Sort order used for each column in this file */
7: optional list<ColumnOrder> column_orders;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan for backwards/forwards compatibility.

It seems like a newer implementation can assume that stats were written in the old way if column_orders is unset (this might be useful still when reading old data with int or floating point stats) or in the new way if column_orders is set. But is there a plan to guarantee that old implementations will ignore the new stats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if column_orders is missing that means they are written the old way.
In the current PR it looks like old versions of the library won't tell the difference between old and new stats. If we want to make sure they don't see the new stats, we need to put them in a different field in the Stats struct.
Do we want to do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to making them a different field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think this was clear from previous discussion, but not from the PR.

Readers must assume that the order is signed, unless this SortOrder is set because that is what the Java library currently writes. The original min and max stats fields are still used because the majority of the values are valid. There are only a few types that are currently stored incorrectly:

  • UTF8 strings
  • Decimals (stored as fixed or bytes)
  • Intervals
  • UINTs

Unsigned ints and intervals aren't used in any engine that I know about. Decimals are used, but I doubt anyone is filtering based on the stats because there isn't a way to pass a decimal to the filter code path in Parquet Java. (You'd have to pass an encoded decimal Binary; same problem for intervals.)

That means that the only type with incorrect stats that is really a problem is UTF8. The reason why we didn't catch this problem sooner is because characters stored as a single byte without the msb set (including ASCII) have the same sort order using signed and unsigned comparison. This is why Parquet Java has a property to ignore the wrong sort order and use stats anyway.

Using a separate field for the new stats would mean that old readers can't use stats for UTF8 fields, even though there are lots of cases where it is fine, and it is not really worse than what they do already -- behavior no one noticed was wrong for a year. It would also require more format changes and could prevent older readers from using stats for signed fields that are correct.

@julienledem
Copy link
Member

thanks for the review @timarmstrong.

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
any last comments?

@timarmstrong
Copy link
Contributor

timarmstrong commented Feb 15, 2017 via email

@rdblue
Copy link
Contributor Author

rdblue commented Feb 15, 2017

@timarmstrong, even after this PR, there's no way to record that you're using a special unicode collation, so it isn't allowed. If we included the custom order identifier, then readers that know to only use stats when they support the order are fine because they can ignore them. Older readers would not produce correct results.

This raises a good point about format compatibility. I think it is okay to continue using the current min and max fields because the result is no more broken than it is already. But, other collating sequences are forward-incompatible changes to the format if they use the same fields. Also, the proposals to add symbols to the Order struct are forward-incompatible changes: A reader that expects only SIGNED and UNSIGNED can't deserialize footers with new symbols.

That's enough to convince me that you guys were right all along and we should use new min and max fields. We can also store min and max in the old fields for older readers when compatible, but new stats that may have a different ordering shouldn't be available to older readers.

I'll add a commit that does the following:

  • Adds new min and max fields to Statistics and notes that the old fields are optional and can be set when the sort order of the column is SIGNED.
  • Add back the CUSTOM symbol to the Order enum and the custom_order string in ColumnOrder. Since we can't add symbols in the future, I think this is a good way to allow extension without forward-incompatible changes.

@isnotinvain and @julienledem, you'll probably want to take another look after the new commit.

This avoids forward-compatibility problems.
@julienledem
Copy link
Member

adding new fields is fine by me. although we could rename min and max to min_signed and max_signed for clarity. Thrift allows renaming fields.
I put a comment above regarding the enum.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 24, 2017

@julienledem, I updated this to use a union. One last review?

@isnotinvain
Copy link
Contributor

Union changes LGTM, thanks! +1

@rdblue
Copy link
Contributor Author

rdblue commented Mar 24, 2017

Thanks for the quick review, @isnotinvain! I'll merge early next week after everyone has had a chance to weigh in on the changes.

@lekv
Copy link
Contributor

lekv commented Mar 31, 2017

Are we ready to merge this one?

@rdblue
Copy link
Contributor Author

rdblue commented Mar 31, 2017

I think it's ready, but I thought @julienledem may want to have another look.

@lekv
Copy link
Contributor

lekv commented Mar 31, 2017

I just tried out the parquet.thrift file by copying it into the Impala sources and trying to compile and gcc choked on ConvertedType.NULL.

In file included from be/src/util/debug-util.h:34:0,
                 from be/src/catalog/catalog-util.cc:24:
be/generated-sources/gen-cpp/parquet_types.h:58:5: error: expected identifier before ‘__null’
     NULL = 25
     ^
be/generated-sources/gen-cpp/parquet_types.h:58:5: error: expected ‘}’ before ‘__null’
be/generated-sources/gen-cpp/parquet_types.h:58:5: error: expected unqualified-id before ‘__null’
In file included from be/src/util/debug-util.h:34:0,
                 from be/src/catalog/catalog-util.cc:24:
be/generated-sources/gen-cpp/parquet_types.h:236:3: error: ‘Type’ does not name a type
   Type::type type;
   ^
be/generated-sources/gen-cpp/parquet_types.h:241:3: error: ‘ConvertedType’ does not name a type
   ConvertedType::type converted_type;
   ^

Has anyone else tried to compile this with g++? We're currently using gcc-4.9.2 and thrift-0.9.0.

I'll investigate further but I also thought someone may have seen this before.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 31, 2017

@lekv, ConvertedType.NULL is new, but was removed in #51 to avoid this problem. Reviews on that one are welcome.

@lekv
Copy link
Contributor

lekv commented Apr 3, 2017

@rdblue - I missed it wasn't actually introduced in this commit. Apologies for the noise.

I also noticed that the current layout of the comments in Statistics seem to confuse the thrift compiler when turning them into a Python class comment.

class Statistics:
  """
  Statistics per row group and per page
  All fields are optional.

  Attributes:
   - max: DEPRECATED: min and max value of the column. Use min_value and max_value.

  Values are encoded using PLAIN encoding, except that variable-length byte
  arrays do not include a length prefix.

  These fields encode min and max values determined by SIGNED comparison
  only. New files should use the correct order for a column's logical type
  and store the values in the min_value and max_value fields.

  To support older readers, these may be set when the column order is
  SIGNED.
   - min
   - null_count: count of null value in the column
   - distinct_count: count of distinct values occurring
   - max_value: Min and max values for the column, determined by its ColumnOrder.

  Values are encoded using PLAIN encoding, except that variable-length byte
  arrays do not include a length prefix.
   - min_value
  """

Do we usually try to format the Thrift code in a way that the comments look nice in Python?

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM
Thanks!

* be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
* as fixed or bytes should be ignored.
*/
7: optional list<ColumnOrder> column_orders;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a note that there is exactly one ColumnOrder per column as defined by the schema (1 column per leaf node in the schema)?

union ColumnOrder {
1: Signed SIGNED;
2: Unsigned UNSIGNED;
}
Copy link
Member

@julienledem julienledem Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for getting back at this. (I blame the jet lag for answering too fast on my previous comment)
Actually I'd suggest the following (to have a more Union like definition and less ENUM as Union):

/** 
 * sorted according to the physical type (considered signed or not)
 * binary is in lexicographic order (assuming each byte is signed or unsigned)
 * numerical types are sorted according to their natural order (signed of unsigned)
 */
struct PhysicalTypeColumnOrder { 
  1: required boolean signed;
}
/** Union containing the order used for min, max, and sorting values in a column */
 union ColumnOrder {
   1: PhysicalTypeColumnOrder physical;
 }

and naturally in the future we would add the following as a second type in the union:

/** the data is sorted according to the provided collation */
struct CollatedColumnOrder {
  1: required string collation_string;
}

* Sort order used for each column in this file.
*
* If this list is not present, then the order for each column is assumed to
* be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the second sentence. Why would we need to ignore decimal min/max values if the order is not present? Why does it say "should"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the order isn't present, then it was byte-wise signed comparsion that is useless. Readers should ignore the values.

@rdblue rdblue force-pushed the PARQUET-686-add-stats-ordering branch from b812eca to f878c34 Compare April 14, 2017 23:14
@rdblue
Copy link
Contributor Author

rdblue commented Apr 14, 2017

@julienledem, update this. One more review?

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdblue
Copy link
Contributor Author

rdblue commented Apr 14, 2017

I'll give others a day or two to weigh in and then commit this.

@asfgit asfgit closed this in 041708d Apr 17, 2017
@rdblue
Copy link
Contributor Author

rdblue commented Apr 17, 2017

Merged. Thanks for all of the discussion and reviews, everyone!

lekv pushed a commit to lekv/parquet-format that referenced this pull request May 5, 2017
PR apache#46 introduced ColumnOrder with the limitation that a reader should
ignore stats for a column if the corresponding ColumnOrder in
FileMetaData contains an unknown value.

This change adds a special column order 'InvalidOrder' that can be used
to in tests and should not be used otherwise.

This change also fixes the paths in Makefile.
lekv pushed a commit to lekv/parquet-format that referenced this pull request Jul 31, 2017
…ode paths

I also did a bit of tidying / reorganization and giving interfaces more descriptive names.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#46 from wesm/PARQUET-501 and squashes the following commits:

491aa89 [Wes McKinney] * Add a basic OutputStream abstract interface and an InMemoryOutputStream   implementation for testing. * Refactor to use OutputStream on data encoding paths, reduce some code   duplication in column-reader-test. * Collect all input/output classes into util/input.* and util/output.*. * Use int64_t in InputStream::Peek/Read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants