Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOCOL RFC] Variant data type #2867

Merged
merged 7 commits into from
Apr 25, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update text
  • Loading branch information
gene-db committed Apr 19, 2024
commit 3c3b432ec8f89f536d7a45c09a4f9bffa0552fe0
11 changes: 6 additions & 5 deletions protocol_rfcs/variant-type.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ To support this feature:

## Variant data in Parquet

The Variant data type is represented as two binary encoded values, according to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
The Variant data type is represented as two binary encoded values, according to the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
The two binary values are named `value` and `metadata`.

When writing Variant data to parquet files, the Variant data is written as a single Parquet struct, with the following fields:
Expand All @@ -57,7 +57,6 @@ metadata | binary | The binary-encoded Variant metadata, as described in [Varian
The parquet struct must include the two struct fields `value` and `metadata`.
Supported writers must write the two binary fields, and supported readers must read the two binary fields.
Struct fields which start with `_` (underscore) can be safely ignored.
The only non-ignorable fields must be `value` and `metadata`.

## Writer Requirements for Variant Data Type

Expand All @@ -68,9 +67,11 @@ When Variant type is supported (`writerFeatures` field of a table's `protocol` a
## Reader Requirements for Variant Data Type

When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
- must be able to read the two parquet struct fields, `value` and `metadata` and interpret them as a Variant in concordance with the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
- It is recommended but not required for a Delta reader to treat the struct as a single indivisible Variant field, if the reader is used in an engine or other context that supports Variant.
- can ignore any parquet struct field names starting with `_` (underscore)
- must recognize and tolerate a `variant` data type in a Delta schema
- must use the correct physical schema (struct-of-binary, with fields `value` and `metadata`) when reading a Variant data type from file
- must make the column available to the engine:
- [Recommended] Expose and interpret the struct-of-binary as a single Variant field in accordance with the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
- [Alternate] Expose the raw physical struct-of-binary, e.g. if the engine does not support Variant.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also an acceptable alternate, to expose a string column by internally converting the struct-of-binary? Gives up all the benefits of the variant encoding, but maximizes compat with engines that don't allow users to load the library code that would interpret the struct-of-binary directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that should work. I added that as another alternate.


## Compatibility with other Delta Features

Expand Down
Loading