PARQUET-686: Add Order to store the order used for min/max stats. #46

rdblue · 2016-11-13T21:29:36Z

This adds a new enum, Order, that will be set to the order used to produce the min and max values in all Statistics objects (at the page level). Order has 8 symbols: SIGNED, UNSIGNED, and 6 symbols for custom orderings. This also adds a CustomOrder struct that is used to map the custom order symbols to string descriptors, such as order keywords used by ICU collating sequences. CustomOrder mappings are stored in the file footer.

rdblue · 2016-11-14T19:33:09Z

@isnotinvain and @julienledem, could you have a look?

julienledem · 2016-11-14T19:37:30Z

src/main/thrift/parquet.thrift

+  CUSTOM3 = 4;
+  CUSTOM4 = 5;
+  CUSTOM5 = 6;
+  CUSTOM6 = 7;


I'm not a big fan of this mechanism for extending ordering.
How about we define Collation as a string defined by the unicode standard as in your link bellow:
See http://userguide.icu-project.org/collation/api#TOC-Instantiating-the-Predefined-Collators
And we define "BINARY" as the unsigned order and "SIGNED_BINARY" as the signed one.

or possibly have more of a union style struct:

enum CollationStandard { BINARY, // only "SIGNED" and "UNSIGNED" are accepted UNICODE; // must provide locale } struct Collation { 1: required CollationStandard std; 2: required string name; }

I like this union style struct approach as well.

The problem this enum is trying to solve is to avoid writing collation identifier strings in the metadata of each page. The approach I took was to keep that field small with an enum and translate the custom symbols to collating sequence identifier strings using metadata in the footer.

If we used the union-style struct approach, then "signed", "unsigned", or the identifier string would be present in every page. At a minimum, I think signed and unsigned should be symbols and only custom strings would be in every page.

I don't think the sort order should be allowed to change per page. So it should not be in the page metadata but in the footer per column. For example in ColumnMetaData or in SchemaElement.
I would tend to pick SchemaElement.
As a separate note, if it is an id to lookup in a table then we should make it an integer instead of an enum.

What if we just require the collation identifier string to be present when users are not using signed binary / unsigned binary?

piyushnarang

Approach for the most part looks good to me. Added some minor comments.

piyushnarang · 2016-11-15T00:27:29Z

src/main/thrift/parquet.thrift

+   * If this field is missing, the order must be the signed ordering for
+   * backward-compatibility.
+   */
+  5: optional Order order;


wondering if there would be a scenario where users want to store multiple ordering stats? (e.g. to support multiple kinds of queries?)

I don't think we will need them. Can you think of a specific use case for it?

Yeah I don't have a concrete use-case for it right now but I think it might be possible that users want to support multiple kinds of queries over their data (one with locale = x, one with locale = y)? Don't think we need to handle this right now though.

piyushnarang · 2016-11-15T00:32:33Z

src/main/thrift/parquet.thrift

+
+  /**
+   * A string that identifies the order denoted by the place-holder order enum
+   * symbol. Ordering descriptors should be well-known identifiers, such as the


should we be explicit about exactly what the identifiers should be? The comment as it stands now seems open to interpretation. If we explicitly say the identifiers should be only the locale keywords in ICU-54, that allows various language implementations to code on that.

At this point, we don't have an implementation that supports custom collating sequences, so I'm hesitant to be very prescriptive about identifying them. The implementation I envision is allowing callers to pass a Comparator and an identifier String, without the Parquet library controlling sorting. If that's something we can agree on, then this shouldn't be an area where we need to impose requirements.

piyushnarang · 2016-11-15T00:35:03Z

src/main/thrift/parquet.thrift

+  CUSTOM3 = 4;
+  CUSTOM4 = 5;
+  CUSTOM5 = 6;
+  CUSTOM6 = 7;


so if I understand right, we can't have more than 6 custom orders for a given parquet file right?

rdblue · 2016-12-07T19:04:14Z

Sorry for the delay, I didn't get a notification that there were comments. I'll look through and respond. Thanks for reviewing, everyone!

rdblue · 2017-01-06T23:49:58Z

@julienledem, @piyushnarang, can you both take another look at this? I'd like to get this fixed so we can release it.

piyushnarang

Changes for the most part look good to me. Also think an enum for CollationStandard would be nice.

piyushnarang · 2017-01-07T01:16:29Z

src/main/thrift/parquet.thrift

+  CUSTOM3 = 4;
+  CUSTOM4 = 5;
+  CUSTOM5 = 6;
+  CUSTOM6 = 7;


What if we just require the collation identifier string to be present when users are not using signed binary / unsigned binary?

piyushnarang · 2017-01-07T02:13:11Z

src/main/thrift/parquet.thrift

+   * If this field is missing, the order must be the signed ordering for
+   * backward-compatibility.
+   */
+  5: optional Order order;


Yeah I don't have a concrete use-case for it right now but I think it might be possible that users want to support multiple kinds of queries over their data (one with locale = x, one with locale = y)? Don't think we need to handle this right now though.

lekv · 2017-01-09T15:36:52Z

src/main/thrift/parquet.thrift

@@ -197,18 +197,78 @@ enum FieldRepetitionType {
  REPEATED = 2;
 }

+/** Identifier for the sort order used to produce min and max values. */


Wouldn't this also apply to the sorting_columns_ in the row groups?

I think sorting columns may need a separate update. These identifiers could potentially be used there, but I don't think the ordering (or direction) is currently stored.

rdblue · 2017-02-06T17:55:15Z

I updated this in response to @julienledem's comments. I think he's right that we only need to store the order once for each column, so this now adds list<ColumnOrder> to the footer. The Statistics objects are unchanged because there's no need to add the order to each stats object in the format.

In the new commit, ColumnOrder is a struct 2 fields: the first field is order, which is an Order enum with SIGNED, UNSIGNED, and CUSTOM and the second field is custom_order, a string that describes the order used if order was CUSTOM.

rdblue · 2017-02-06T19:28:47Z

I've updated the last commit to include the following:

Added ColumnOrder.is_ascending boolean
Added a specific format for the custom_order string

I had initially added ICU54 to the Order enum, but it seems confusing to have Order include both specific orderings and classes of orderings, so I went with the string prefix option. We can change it if anyone is strongly opposed.

julienledem · 2017-02-06T22:58:31Z

src/main/thrift/parquet.thrift

+  0: required Order order;
+
+  /** Whether the order used is ascending (true) or descending (false) */
+  1: required boolean is_ascending;


SortingColumn already has this which defines the direction of the sort in the file:
4: optional list<SortingColumn> sorting_columns
In the case of statistics I don't think we should add an 'is_ascending' field in ColumnOrder as this basically just swaps min and max.

Works for me.

isnotinvain · 2017-02-07T00:08:20Z

src/main/thrift/parquet.thrift

+ */
+struct ColumnOrder {
+  /** The order used for this column */
+  0: required Order order;


in thrift we usually start field orderings at number 1 not 0. I don't remember why, but I see that we're doing that in the other structs so I think we should do it here too.

isnotinvain · 2017-02-07T00:12:16Z

src/main/thrift/parquet.thrift

@@ -576,5 +631,8 @@ struct FileMetaData {
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by
+
+  /** Sort order used for each column in this file */
+  7: optional list<ColumnOrder> column_orders;


How do you know from 1 ColumnOrder which column it applies to? It's position in this list has to match the position in the schema list above? If that's the case I think we should add a comment for that here. I guess this has to be here to avoid repeating it in each Statistics object?

There is one ColumnOrder for each column, in the same order as the column chunks (which is required to match the order in the schema).

Another option is to add an optional column_order field in SchemaElement which would set only for primitive types.
10: optional ColumnOrder column_order;
https://github.com/rdblue/parquet-format/blob/9962df8e0ea85858cf032451d0ee83ec3f4d39fe/src/main/thrift/parquet.thrift#L260

My reasoning for not adding this to the schema is that I don't think I'd add this to the string-based schema definition. I want to keep those roughly the same by not adding things to SchemaElement that aren't in the schema strings.

isnotinvain · 2017-02-07T00:16:09Z

src/main/thrift/parquet.thrift

+   * identify the actual order used for min, max, and soring values.
+   *
+   * This identifier should follow one of the following formats:
+   * * 'icu54:<locale-keyword>' - ICU 54 ordering for the ICU 54 locale keyword


what about something like:

struct ColumnOrder { 1: required OrderType orderType 2: optional string orderSubType 3: required boolean is_ascending }

enum OrderType { SIGNED, UNSIGNED, ICU_54, OTHER // more can be added as needed }

and orderSubType isn't set for SIGNED/UNSIGNED ?

Actually, if we really want to avoid the confusion of mixing specific orders and classes of orders, we can use a union instead:

struct ColumnOrder { 1: required Order order 2: required boolean is_ascending } struct Signed {} struct Unsigned {} struct ICU_54 { 1: required string locale_keyword } struct Other { 1: required string name } union Order { 1: Signed signed 2: Unsigned unsigned 3: ICU_54 icu_54 4: Other // add more as needed }

This will generate an Order struct that can only be one of {Signed / Unsigned / ICU_54(locale_keyword) / Other(name) and avoids having to deal with the fact that Signed / Unsigned require no additional info, and it also means we can add extra info to each type of ordering as needed.

I've removed the other symbols. We can revisit this when we know more about the requirements of processing engines.

I think we still need to finalize this aspect of the discussion.
the signed/unsigned aspect is what's blocking impala/sparksql right now.
When they move to implement collation it will be easier to converge on the collation string spec.
Let's open a separate JIRA for that.

In that case, should we remove the commented out fields + their docs?

@rdblue It sounds like Union? +1 from me as well

I don't think we've confirmed that the union is forward-compatible. I haven't had time to check, so if either of you wants to that would be helpful.

Works for thrift 0.9.3. I don't have thrift 0.7 working on my laptop right now.
If you want to try with thrift 0.7:
apache/parquet-java#405

related: we should just update parquet-format to thrift 0.9.3
#50

@rdblue: works for thrift 0.7 per travis-ci: apache/parquet-java#405

This adds a list of ColumnOrder to FileMetaData. The list should include one ColumnOrder for each column in the order in the same order as the file's column chunks. Each ColumnOrder is a struct with an Order enum (signed, unsigned, or custom) and an optional string to describe custom orders.

julienledem

This looks good to me.
My last question is whether column_order should be in SchemaElement (see comment above) rather than a list in FileMetadata.

lekv · 2017-02-07T17:54:15Z

LogicalTypes.md

@@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error.
 A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
 `scale` and `precision` fields set, even if scale is 0 by default.

+The sort order used for `DECIMAL` values must be `SIGNED`. The order is


"is" or "must be" seem superfluous. It seems like a logical conclusion, so I'd keep "is".

lekv · 2017-02-07T18:00:39Z

src/main/thrift/parquet.thrift

+/** Descriptor for the order used for min, max, and sorting values in a column
+ */
+struct ColumnOrder {
+  /** The order used for this column */


Should we add to the comment that this has to be ascending?

I don't think so. I think min and max are clear enough.

lekv · 2017-02-07T18:25:37Z

src/main/thrift/parquet.thrift

+  /**
+   * Identifiers for custom orderings, to be defined in the ColumnOrder struct.
+   */
+  //CUSTOM = 2;


Timestamps are currently stored in INT96 by Impala and Hive and with INT96 depreciation on the horizon, can we enable custom ordering for binary / fixed types in this change?

what would CUSTOM mean in that context?

Impala and Hive store timestamps as a tuple (int64_t, uint32_t), where the first element is the time of the day in nanoseconds, and the second is the day according to the Julian calendar. Ordering these means to sort them by the second element first (the day), then by the first (the time). None of the other orderings seem to reflect this, so it would be good if we could at least set it to CUSTOM to signal that any validation will fail unless the client knows how to interpret them.

Does it make more sense to document something like IMPALA_TIMESTAMP as its own logical type?

We definitely should change the situation as currently the values stored as INT96 aren't really interpretable as integers, better open a separate JIRA to untangle it a bit from the sorting discussion here.

@lekv I also tried to document the current state here: #49 would be nice if you also could have a short look if the formula is correct.

@xhochy - I left some comments in #49. I agree that it's better to untangle / document Impala timestamps in a separate JIRA. Should we use PARQUET-861 for that, or shall I go ahead and make a new one?

@julienledem - Do you think it's better to defer adding "CUSTOM" orderings until adding the first LogicalType that needs one?

I don't understand the claim that INT96 aren't interpretable as integers. They don't map to a common hardware integer type but they're still logically a twos-complement integer.

I think on the Impala side we may end up implementing min-max stats for INT96 ahead of doing a larger rework of the timestamp data type, so it would be good to have clarity about what other implementations should do when encountered with INT96 min-max stats.

My take is that they should either ignore INT96 stats (if they don't want to implement a deprecated type) or use signed ordering. Parquet-mr doesn't do this currently. Does everyone agree that https://issues.apache.org/jira/browse/PARQUET-840 is a bug in parquet-mr?

I guess I'm assuming that signed INT96 ordering is equivalent to the logical IMPALA_TIMESTAMP ordering - I think this is true. because the least significant 8 bytes of the INT96 are the unsigned nanoseconds component and the most significant 4 bytes of the INT96 are the signed Julian day.

@timarmstrong if that's the case then we can just use SIGNED Ordering for them.
@lekv do you agree?

@timarmstrong I assumed we are writing (nanoseconds, day) tuples, and the UNSIGNED order would not be equivalent to the logical ordering. Internally Impala stores them as (time, date) and we don't reverse those. Hive seems to do it the same way as Impala: https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/example/data/simple/NanoTime.java#L58

I think it works out because INT96 is little-endian - the most significant 4 bytes of the INT96 line up with the 4 bytes of the date

julienledem

Looks good to me.
I think we moved the remaining discussion points to separate JIRAs and this addresses the first use case of having the default UTF-8 ordering for strings and fix ordering of other unsigned logical types.
Follow up in separate JIRAS:

int96
collations for string type

timarmstrong · 2017-02-07T19:34:06Z

LogicalTypes.md

@@ -55,6 +57,8 @@ allows.
 implied by the `int32` and `int64` primitive types if no other annotation is
 present and should be considered optional.

+The sort order used for signed integer types must be `SIGNED`.


So this means that writers are responsible for setting order to SIGNED when writing a column of this type, right? What is the expected behaviour if a reader encounters a logical type of signed integer but UNSIGNED order? Is this invalid according to the spec, or does the sort order setting override the default sort for the logical type?

My interpretation is we should fail.
This is redundant with logical type but more explicit and deals with backward compatibility.

Sounds reasonable to me.

Any implementation should ignore stats that it cannot use. I think Lars mentioned in the recent hangout that there are transforms that can be done in some cases to use stats that are written with some compatible orderings.

A simple example is where a UINT32 field has min=10 and max=1000. Since both are positive and not in the top half of the UINT32 range, we know that the same min/max are produced by both signed and unsigned comparison and the implementation can use it.

timarmstrong · 2017-02-07T19:38:18Z

LogicalTypes.md

@@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields.
 `UTF8` may only be used to annotate the binary primitive type and indicates
 that the byte array should be interpreted as a UTF-8 encoded character string.

+The sort order used for `UTF8` strings must be `UNSIGNED` byte-wise comparison.


It might be helpful for implementers to note that this is equivalent to ordering by unicode code points, since that's a bit subtle. If you feel like that's overly verbose feel free to leave it out.

Is this true? I thought the sort order for strings should be unsigned, from now on, but it's valid and possible for old files to have signed ordering. Should we clarify that both orderings are supported but one is preferred?

This is stating that all UTF8 from now on should be written with UNSIGNED comparison. It is still the case that files without the SortOrder list used SIGNED.

timarmstrong · 2017-02-07T19:38:57Z

LogicalTypes.md

+The sort order used for `DECIMAL` values must be `SIGNED`. The order is
+must be equivalent to signed comparison of decimal values.
+
+If the column uses `int32` or `int64` physical types, then signed comparison of


Thanks for this clarification.

timarmstrong · 2017-02-07T19:40:56Z

LogicalTypes.md

+
+If the column uses `int32` or `int64` physical types, then signed comparison of
+the integer values produces the correct ordering. If the physical type is
+fixed, then the correct ordering can be produced by flipping the


fixed_len_byte_array?

Yes. Fixed is used interchangeably with the full name elsewhere as well.

Gotcha, I'm not totally familiar with some of these conventions.

timarmstrong · 2017-02-07T19:42:34Z

LogicalTypes.md

+the integer values produces the correct ordering. If the physical type is
+fixed, then the correct ordering can be produced by flipping the
+most-significant bit in the first byte and then using unsigned byte-wise
+comparison.


What about decimals encoded into variable-length binary?

There is an optimization, but it is too long to describe here. The main point is that the order must match the signed order produced by comparing the decimal values. This paragraph only gives optimizations for easy types.

Makes sense, thanks for explaining.

timarmstrong · 2017-02-07T19:46:13Z

LogicalTypes.md

@@ -144,6 +169,9 @@ example, there is no requirement that a large number of days should be
 expressed as a mix of months and days because there is not a constant
 conversion from days to months.

+The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
+the value of months, then days, then milliseconds with unsigned comparison.
+
 ## Embedded Types


It would be good to clarify what the expected sort order is for embedded types (or if we want to punt on that question, make it explicit that readers should ignore it).

In the latest commit, I clarified that there isn't a required ordering for embedded types.

timarmstrong · 2017-02-07T19:50:37Z

src/main/thrift/parquet.thrift

@@ -576,5 +628,8 @@ struct FileMetaData {
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by
+
+  /** Sort order used for each column in this file */
+  7: optional list<ColumnOrder> column_orders;


What's the plan for backwards/forwards compatibility.

It seems like a newer implementation can assume that stats were written in the old way if column_orders is unset (this might be useful still when reading old data with int or floating point stats) or in the new way if column_orders is set. But is there a plan to guarantee that old implementations will ignore the new stats?

Yes, if column_orders is missing that means they are written the old way.
In the current PR it looks like old versions of the library won't tell the difference between old and new stats. If we want to make sure they don't see the new stats, we need to put them in a different field in the Stats struct.
Do we want to do that?

+1 to making them a different field.

Sorry, I think this was clear from previous discussion, but not from the PR.

Readers must assume that the order is signed, unless this SortOrder is set because that is what the Java library currently writes. The original min and max stats fields are still used because the majority of the values are valid. There are only a few types that are currently stored incorrectly:

UTF8 strings

Decimals (stored as fixed or bytes)

Intervals

UINTs

Unsigned ints and intervals aren't used in any engine that I know about. Decimals are used, but I doubt anyone is filtering based on the stats because there isn't a way to pass a decimal to the filter code path in Parquet Java. (You'd have to pass an encoded decimal Binary; same problem for intervals.)

That means that the only type with incorrect stats that is really a problem is UTF8. The reason why we didn't catch this problem sooner is because characters stored as a single byte without the msb set (including ASCII) have the same sort order using signed and unsigned comparison. This is why Parquet Java has a property to ignore the wrong sort order and use stats anyway.

Using a separate field for the new stats would mean that old readers can't use stats for UTF8 fields, even though there are lots of cases where it is fine, and it is not really worse than what they do already -- behavior no one noticed was wrong for a year. It would also require more format changes and could prevent older readers from using stats for signed fields that are correct.

julienledem · 2017-02-08T19:38:10Z

thanks for the review @timarmstrong.

julienledem

LGTM
any last comments?

timarmstrong · 2017-02-15T12:09:46Z

I agree it would be useful if the stats were usable by older implementations when they're compatible. Is there any way that a new implementation can write out new-style stats and force older implementations to ignore them? E.g. if I want to use some special unicode collation in my query engine, but also want to read the files via an older parquet-mr, which assumes all stats are the old style.

…

On 15 February 2017 at 02:47, Ryan Blue ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/main/thrift/parquet.thrift <#46>: > @@ -576,5 +628,8 @@ struct FileMetaData { * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) **/ 6: optional string created_by + + /** Sort order used for each column in this file */ + 7: optional list<ColumnOrder> column_orders; Sorry, I think this was clear from previous discussion, but not from the PR. Readers must assume that the order is signed, unless this SortOrder is set because that is what the Java library currently writes. The original min and max stats fields are still used because the majority of the values are valid. There are only a few types that are currently stored incorrectly: - UTF8 strings - Decimals (stored as fixed or bytes) - Intervals - UINTs Unsigned ints and intervals aren't used in any engine that I know about. Decimals are used, but I doubt anyone is filtering based on the stats because there isn't a way to pass a decimal to the filter code path in Parquet Java. (You'd have to pass an encoded decimal Binary; same problem for intervals.) That means that the only type with incorrect stats that is really a problem is UTF8. The reason why we didn't catch this problem sooner is because characters stored as a single byte without the msb set (including ASCII) have the same sort order using signed and unsigned comparison. This is why Parquet Java has a property to ignore the wrong sort order and use stats anyway. Using a separate field for the new stats would mean that old readers can't use stats for UTF8 fields, even though there are lots of cases where it is fine, and it is not really worse than what they do already -- behavior no one noticed was wrong for a year. It would also require more format changes and could prevent older readers from using stats for signed fields that are correct. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABGjqzo1Ku0yEvt3IH-NjbbKkg-ZKnGBks5rcdqQgaJpZM4Kwzz5> .

rdblue · 2017-02-15T18:01:17Z

@timarmstrong, even after this PR, there's no way to record that you're using a special unicode collation, so it isn't allowed. If we included the custom order identifier, then readers that know to only use stats when they support the order are fine because they can ignore them. Older readers would not produce correct results.

This raises a good point about format compatibility. I think it is okay to continue using the current min and max fields because the result is no more broken than it is already. But, other collating sequences are forward-incompatible changes to the format if they use the same fields. Also, the proposals to add symbols to the Order struct are forward-incompatible changes: A reader that expects only SIGNED and UNSIGNED can't deserialize footers with new symbols.

That's enough to convince me that you guys were right all along and we should use new min and max fields. We can also store min and max in the old fields for older readers when compatible, but new stats that may have a different ordering shouldn't be available to older readers.

I'll add a commit that does the following:

Adds new min and max fields to Statistics and notes that the old fields are optional and can be set when the sort order of the column is SIGNED.
Add back the CUSTOM symbol to the Order enum and the custom_order string in ColumnOrder. Since we can't add symbols in the future, I think this is a good way to allow extension without forward-incompatible changes.

@isnotinvain and @julienledem, you'll probably want to take another look after the new commit.

This avoids forward-compatibility problems.

julienledem · 2017-02-15T23:09:47Z

adding new fields is fine by me. although we could rename min and max to min_signed and max_signed for clarity. Thrift allows renaming fields.
I put a comment above regarding the enum.

rdblue · 2017-03-24T21:34:43Z

@julienledem, I updated this to use a union. One last review?

isnotinvain · 2017-03-24T21:52:50Z

Union changes LGTM, thanks! +1

rdblue · 2017-03-24T22:06:27Z

Thanks for the quick review, @isnotinvain! I'll merge early next week after everyone has had a chance to weigh in on the changes.

lekv · 2017-03-31T19:36:26Z

Are we ready to merge this one?

rdblue · 2017-03-31T19:40:54Z

I think it's ready, but I thought @julienledem may want to have another look.

lekv · 2017-03-31T20:09:13Z

I just tried out the parquet.thrift file by copying it into the Impala sources and trying to compile and gcc choked on ConvertedType.NULL.

In file included from be/src/util/debug-util.h:34:0,
                 from be/src/catalog/catalog-util.cc:24:
be/generated-sources/gen-cpp/parquet_types.h:58:5: error: expected identifier before ‘__null’
     NULL = 25
     ^
be/generated-sources/gen-cpp/parquet_types.h:58:5: error: expected ‘}’ before ‘__null’
be/generated-sources/gen-cpp/parquet_types.h:58:5: error: expected unqualified-id before ‘__null’
In file included from be/src/util/debug-util.h:34:0,
                 from be/src/catalog/catalog-util.cc:24:
be/generated-sources/gen-cpp/parquet_types.h:236:3: error: ‘Type’ does not name a type
   Type::type type;
   ^
be/generated-sources/gen-cpp/parquet_types.h:241:3: error: ‘ConvertedType’ does not name a type
   ConvertedType::type converted_type;
   ^

Has anyone else tried to compile this with g++? We're currently using gcc-4.9.2 and thrift-0.9.0.

I'll investigate further but I also thought someone may have seen this before.

rdblue · 2017-03-31T20:23:11Z

@lekv, ConvertedType.NULL is new, but was removed in #51 to avoid this problem. Reviews on that one are welcome.

lekv · 2017-04-03T16:15:03Z

@rdblue - I missed it wasn't actually introduced in this commit. Apologies for the noise.

I also noticed that the current layout of the comments in Statistics seem to confuse the thrift compiler when turning them into a Python class comment.

class Statistics:
  """
  Statistics per row group and per page
  All fields are optional.

  Attributes:
   - max: DEPRECATED: min and max value of the column. Use min_value and max_value.

  Values are encoded using PLAIN encoding, except that variable-length byte
  arrays do not include a length prefix.

  These fields encode min and max values determined by SIGNED comparison
  only. New files should use the correct order for a column's logical type
  and store the values in the min_value and max_value fields.

  To support older readers, these may be set when the column order is
  SIGNED.
   - min
   - null_count: count of null value in the column
   - distinct_count: count of distinct values occurring
   - max_value: Min and max values for the column, determined by its ColumnOrder.

  Values are encoded using PLAIN encoding, except that variable-length byte
  arrays do not include a length prefix.
   - min_value
  """

Do we usually try to format the Thrift code in a way that the comments look nice in Python?

julienledem

+1 LGTM
Thanks!

julienledem · 2017-04-04T13:48:37Z

src/main/thrift/parquet.thrift

+   * be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
+   * as fixed or bytes should be ignored.
+   */
+  7: optional list<ColumnOrder> column_orders;


Do you want to add a note that there is exactly one ColumnOrder per column as defined by the schema (1 column per leaf node in the schema)?

julienledem · 2017-04-06T08:57:07Z

src/main/thrift/parquet.thrift

+union ColumnOrder {
+  1: Signed SIGNED;
+  2: Unsigned UNSIGNED;
+}


sorry for getting back at this. (I blame the jet lag for answering too fast on my previous comment)
Actually I'd suggest the following (to have a more Union like definition and less ENUM as Union):

/** * sorted according to the physical type (considered signed or not) * binary is in lexicographic order (assuming each byte is signed or unsigned) * numerical types are sorted according to their natural order (signed of unsigned) */ struct PhysicalTypeColumnOrder { 1: required boolean signed; } /** Union containing the order used for min, max, and sorting values in a column */ union ColumnOrder { 1: PhysicalTypeColumnOrder physical; }

and naturally in the future we would add the following as a second type in the union:

/** the data is sorted according to the provided collation */ struct CollatedColumnOrder { 1: required string collation_string; }

lekv · 2017-04-06T22:24:09Z

src/main/thrift/parquet.thrift

+   * Sort order used for each column in this file.
+   *
+   * If this list is not present, then the order for each column is assumed to
+   * be Signed. In addition, min and max values for INTERVAL or DECIMAL stored


I am not sure I understand the second sentence. Why would we need to ignore decimal min/max values if the order is not present? Why does it say "should"?

If the order isn't present, then it was byte-wise signed comparsion that is useless. Readers should ignore the values.

rdblue · 2017-04-14T23:18:41Z

@julienledem, update this. One more review?

julienledem

LGTM

rdblue · 2017-04-14T23:52:49Z

I'll give others a day or two to weigh in and then commit this.

rdblue · 2017-04-17T18:24:30Z

Merged. Thanks for all of the discussion and reviews, everyone!

PR apache#46 introduced ColumnOrder with the limitation that a reader should ignore stats for a column if the corresponding ColumnOrder in FileMetaData contains an unknown value. This change adds a special column order 'InvalidOrder' that can be used to in tests and should not be used otherwise. This change also fixes the paths in Makefile.

…ode paths I also did a bit of tidying / reorganization and giving interfaces more descriptive names. Author: Wes McKinney <wes@cloudera.com> Closes apache#46 from wesm/PARQUET-501 and squashes the following commits: 491aa89 [Wes McKinney] * Add a basic OutputStream abstract interface and an InMemoryOutputStream implementation for testing. * Refactor to use OutputStream on data encoding paths, reduce some code duplication in column-reader-test. * Collect all input/output classes into util/input.* and util/output.*. * Use int64_t in InputStream::Peek/Read.

xhochy approved these changes Nov 14, 2016

View reviewed changes

julienledem reviewed Nov 14, 2016

View reviewed changes

piyushnarang reviewed Nov 15, 2016

View reviewed changes

piyushnarang approved these changes Jan 7, 2017

View reviewed changes

lekv reviewed Jan 9, 2017

View reviewed changes

julienledem mentioned this pull request Jan 27, 2017

PARQUET-686: Add optional {signed,unsigned}_{min,max} fields to Statistics #42

Closed

rdblue force-pushed the PARQUET-686-add-stats-ordering branch from 5adc3dc to c3ef2e7 Compare February 6, 2017 17:52

rdblue force-pushed the PARQUET-686-add-stats-ordering branch from c3ef2e7 to f4a25da Compare February 6, 2017 19:24

julienledem requested changes Feb 6, 2017

View reviewed changes

isnotinvain reviewed Feb 7, 2017

View reviewed changes

rdblue force-pushed the PARQUET-686-add-stats-ordering branch from f4a25da to 4534062 Compare February 7, 2017 16:40

rdblue added 2 commits February 7, 2017 09:39

PARQUET-686: Add order specs to logical types.

faa9edb

PARQUET-686: Remove is_ascending and number columns starting with 1.

9962df8

julienledem reviewed Feb 7, 2017

View reviewed changes

lekv reviewed Feb 7, 2017

View reviewed changes

lekv mentioned this pull request Feb 8, 2017

PARQUET-861: Document INT96 timestamps #49

Closed

julienledem approved these changes Feb 8, 2017

View reviewed changes

timarmstrong reviewed Feb 8, 2017

View reviewed changes

PARQUET-686: Add clarifications from review comments.

eed4d47

julienledem approved these changes Feb 15, 2017

View reviewed changes

PARQUET-686: Add new min_value and max_value stats.

c6e43b0

This avoids forward-compatibility problems.

rdblue added 2 commits March 24, 2017 14:12

PARQUET-686: Store ColumnOrder as a union.

ffbb60b

PARQUET-686: Use "is" instead of "must be".

9447fb8

julienledem approved these changes Apr 4, 2017

View reviewed changes

julienledem reviewed Apr 6, 2017

View reviewed changes

lekv reviewed Apr 10, 2017

View reviewed changes

PARQUET-686: Remove Order enum.

f878c34

rdblue force-pushed the PARQUET-686-add-stats-ordering branch from b812eca to f878c34 Compare April 14, 2017 23:14

julienledem approved these changes Apr 14, 2017

View reviewed changes

asfgit closed this in 041708d Apr 17, 2017

lekv mentioned this pull request May 4, 2017

PARQUET-826: Update comments on Statistics encoding #48

Closed

lekv mentioned this pull request May 5, 2017

PARQUET-974: Add a special ColumnOrder for testing #53

Closed

zivanfi mentioned this pull request Jun 16, 2017

PARQUET-686: Clarifications about min-max stats. #55

Closed

This was referenced Jun 23, 2024

Allow for Unsigned Statistics in Binary Type #314

Closed

Add a special ColumnOrder for testing #326

Closed

PARQUET-686: Add Order to store the order used for min/max stats. #46

PARQUET-686: Add Order to store the order used for min/max stats. #46

Conversation

rdblue commented Nov 13, 2016 • edited Loading

rdblue commented Nov 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piyushnarang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Dec 7, 2016

rdblue commented Jan 6, 2017

piyushnarang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Feb 6, 2017

rdblue commented Feb 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Nov 13, 2016 •

edited

Loading