Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-459: Add Variant logical type annotation #460

Merged
merged 5 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add Variant as logical type
  • Loading branch information
gene-db committed Oct 18, 2024
commit b15264f021f7264111cddb315a4d65c98735f9a4
17 changes: 17 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,23 @@ defined by the [BSON specification][bson-spec].

The sort order used for `BSON` is unsigned byte-wise comparison.

### VARIANT

`VARIANT` is used for a Variant value. It must annotate a group. The group must
contain a `binary` field named `metadata`, and a `binary` field named `value`.
The `VARIANT` annotated group can be used to store either an unshredded Variant
value, or a shredded Variant value.

* The top level must be a group annotated with `VARIANT` that contains a
`binary` field named `metadata`, and a `binary` field named `value`.
* Additional fields which start with `_` (underscore) can be ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? None of the other types allow writing columns that should be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was desired in case there were some additional (but redundant) metadata or values we might store, and still allow it to be a valid Variant value (group).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we want to add ignored columns. If we need to update the spec because something is missing, we should just do that directly instead of working around it with unspecified columns that only work in certain proprietary cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was worried that future evolution could break existing stored Variants, but simply adding a new field with optional or redundant semantics achieves the same compatibility story. This is removed.

* If `metadata` and `value` are the only fields in the group, then the group
is an unshredded Variant value. The `metadata` and `value` fields are
interpreted as an encoded Variant value as defined by the
[Variant binary encoding specification](VariantEncoding.md).
* If the group contains additional fields, it is a shredded Variant, and must
adhere to the scheme detailed in the [Variant shredding specification](VariantShredding.md).

## Nested Types

This section specifies how `LIST` and `MAP` can be used to encode nested types
Expand Down
2 changes: 1 addition & 1 deletion VariantEncoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ This document describes the Variant Binary Encoding scheme.
[VariantShredding.md](VariantShredding.md) describes the details of the Variant shredding scheme.

# Variant in Parquet
A Variant value in Parquet is represented by a group with 2 fields, named `value` and `metadata`.
A Variant value in Parquet is represented by a group annotated with `VARIANT`, with 2 fields, named `value` and `metadata`.
Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.

# Metadata encoding
Expand Down
12 changes: 6 additions & 6 deletions VariantShredding.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
> **This specification is still under active development, and has not been formally adopted.**

The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values.
Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet.
Query engines encode each Variant value in a self-describing format, and store it as a `VARIANT` annotated group containing `value` and `metadata` binary fields in Parquet.
Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance.
We refer to this process as **shredding**.
Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file.
Expand All @@ -33,7 +33,7 @@ This document focuses on the shredding semantics, Parquet representation, implic
For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns.
The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification.

At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
At a high level, we replace the `value` field of the `VARIANT` annotated Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
These represent a fixed schema suitable for constructing the full Variant value for each row.

Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data).
Expand All @@ -58,7 +58,7 @@ An `variant_value` may also be populated if an object can be partially represent
The `metadata` column is unchanged from its unshredded representation, and may be referenced in `variant_value` fields in the shredded data.

```
optional group variant_col {
optional group variant_col (VARIANT) {
required binary metadata;
optional binary variant_value;
optional group object {
Expand Down Expand Up @@ -226,7 +226,7 @@ It contains an array of objects, containing an `a` field shredded as an array, a
The corresponding Parquet schema with “a” and “b” as leaf types is:

```
optional group variant_col {
optional group variant_col (VARIANT) {
required binary metadata;
optional binary variant_value;
optional group array (LIST) {
Expand Down Expand Up @@ -289,9 +289,9 @@ The second array element can be fully shredded, but the first and third cannot b

# Backward and forward compatibility

Shredding is an optional feature of Variant, and readers must continue to be able to read a group containing only a `value` and `metadata` field.
Shredding is an optional feature of Variant, and readers must continue to be able to read a `VARIANT` annotated group containing only a `value` and `metadata` field.

Any fields in the same group as `typed_value`/`variant_value` that start with `_` (underscore) can be ignored.
Any fields in the same `VARIANT` annotated group as `typed_value`/`variant_value` that start with `_` (underscore) can be ignored.
This is intended to allow future backwards-compatible extensions.
In particular, the field names `_metadata_key_paths` and any name starting with `_spark` are reserved, and should not be used by other implementations.
Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema.
Expand Down
8 changes: 8 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,12 @@ struct JsonType {
struct BsonType {
}

/**
* Embedded Variant logical type annotation
*/
struct VariantType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -410,6 +416,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: VariantType VARIANT // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -980,6 +987,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* VARIANT - undefined
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down