Skip to content

Commit

Permalink
PARQUET-686: Add Order to store the order used for min/max stats.
Browse files Browse the repository at this point in the history
This adds a new enum, `Order`, that will be set to the order used to produce the min and max values in all `Statistics` objects (at the page level). `Order` has 8 symbols: `SIGNED`, `UNSIGNED`, and 6 symbols for custom orderings. This also adds a `CustomOrder` struct that is used to map the custom order symbols to string descriptors, such as [order keywords used by ICU collating sequences](http://userguide.icu-project.org/collation/api#TOC-Instantiating-the-Predefined-Collators). `CustomOrder` mappings are stored in the file footer.

Author: Ryan Blue <blue@apache.org>

Closes apache#46 from rdblue/PARQUET-686-add-stats-ordering and squashes the following commits:

f878c34 [Ryan Blue] PARQUET-686: Remove Order enum.
9447fb8 [Ryan Blue] PARQUET-686: Use "is" instead of "must be".
ffbb60b [Ryan Blue] PARQUET-686: Store ColumnOrder as a union.
c6e43b0 [Ryan Blue] PARQUET-686: Add new min_value and max_value stats.
eed4d47 [Ryan Blue] PARQUET-686: Add clarifications from review comments.
9962df8 [Ryan Blue] PARQUET-686: Remove is_ascending and number columns starting with 1.
faa9edb [Ryan Blue] PARQUET-686: Add order specs to logical types.
4534062 [Ryan Blue] PARQUET-686: Add ColumnOrders to FileMetaData.
  • Loading branch information
rdblue committed Apr 17, 2017
1 parent 65e851e commit 041708d
Show file tree
Hide file tree
Showing 2 changed files with 88 additions and 1 deletion.
30 changes: 30 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields.
`UTF8` may only be used to annotate the binary primitive type and indicates
that the byte array should be interpreted as a UTF-8 encoded character string.

The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison.

## Numeric Types

### Signed Integers
Expand All @@ -55,6 +57,8 @@ allows.
implied by the `int32` and `int64` primitive types if no other annotation is
present and should be considered optional.

The sort order used for signed integer types is `SIGNED`.

### Unsigned Integers

`UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be used to
Expand All @@ -70,6 +74,8 @@ allows.
`UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
`UINT_64` must annotate an `int64` primitive type.

The sort order used for unsigned integer types is `UNSIGNED`.

### DECIMAL

`DECIMAL` annotation represents arbitrary-precision signed decimal numbers of
Expand Down Expand Up @@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error.
A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
`scale` and `precision` fields set, even if scale is 0 by default.

The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent
to signed comparison of decimal values.

If the column uses `int32` or `int64` physical types, then signed comparison of
the integer values produces the correct ordering. If the physical type is
fixed, then the correct ordering can be produced by flipping the
most-significant bit in the first byte and then using unsigned byte-wise
comparison.

## Date/Time Types

### DATE
Expand All @@ -106,30 +121,40 @@ A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
annotate an `int32` that stores the number of days from the Unix epoch, 1
January 1970.

The sort order used for `DATE` is `SIGNED`.

### TIME\_MILLIS

`TIME_MILLIS` is used for a logical time type with millisecond precision,
without a date. It must annotate an `int32` that stores the number of
milliseconds after midnight.

The sort order used for `TIME\_MILLIS` is `SIGNED`.

### TIME\_MICROS

`TIME_MICROS` is used for a logical time type with microsecond precision,
without a date. It must annotate an `int64` that stores the number of
microseconds after midnight.

The sort order used for `TIME\_MICROS` is `SIGNED`.

### TIMESTAMP\_MILLIS

`TIMESTAMP_MILLIS` is used for a combined logical date and time type, with
millisecond precision. It must annotate an `int64` that stores the number of
milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.

The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.

### TIMESTAMP\_MICROS

`TIMESTAMP_MICROS` is used for a combined logical date and time type with
microsecond precision. It must annotate an `int64` that stores the number of
microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.

The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`.

### INTERVAL

`INTERVAL` is used for an interval of time. It must annotate a
Expand All @@ -144,8 +169,13 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.

The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
the value of months, then days, then milliseconds with unsigned comparison.

## Embedded Types

Embedded types do not have type-specific orderings.

### JSON

`JSON` is used for an embedded JSON document. It must annotate a `binary`
Expand Down
59 changes: 58 additions & 1 deletion src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ namespace java org.apache.parquet.format
* with the encodings to control the on disk storage format.
* For example INT16 is not included as a type since a good encoding of INT32
* would handle this.
*
* When a logical type is not present, the type-defined sort order of these
* physical types are:
* * BOOLEAN - false, true
* * INT32 - signed comparison
* * INT64 - signed comparison
* * INT96 - signed comparison
* * FLOAT - signed comparison
* * DOUBLE - signed comparison
* * BYTE_ARRAY - unsigned byte-wise comparison
* * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*/
enum Type {
BOOLEAN = 0;
Expand Down Expand Up @@ -202,13 +213,33 @@ enum FieldRepetitionType {
* All fields are optional.
*/
struct Statistics {
/** min and max value of the column, encoded in PLAIN encoding */
/**
* DEPRECATED: min and max value of the column. Use min_value and max_value.
*
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
*
* These fields encode min and max values determined by SIGNED comparison
* only. New files should use the correct order for a column's logical type
* and store the values in the min_value and max_value fields.
*
* To support older readers, these may be set when the column order is
* SIGNED.
*/
1: optional binary max;
2: optional binary min;
/** count of null value in the column */
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
* Min and max values for the column, determined by its ColumnOrder.
*
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
*/
5: optional binary max_value;
6: optional binary min_value;
}

/**
Expand Down Expand Up @@ -547,6 +578,23 @@ struct RowGroup {
4: optional list<SortingColumn> sorting_columns
}

/** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {}

/**
* Union to specify the order used for min, max, and sorting values in a column.
*
* Possible values are:
* * TypeDefinedOrder - the column uses the order defined by its logical or
* physical type (if there is no logical type).
*
* If the reader does not support the value of this union, min and max stats
* for this column should be ignored.
*/
union ColumnOrder {
1: TypeDefinedOrder TYPE_ORDER;
}

/**
* Description for file metadata
*/
Expand Down Expand Up @@ -576,5 +624,14 @@ struct FileMetaData {
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by

/**
* Sort order used for each column in this file.
*
* If this list is not present, then the order for each column is assumed to
* be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
* as fixed or bytes should be ignored.
*/
7: optional list<ColumnOrder> column_orders;
}

0 comments on commit 041708d

Please sign in to comment.