-
Notifications
You must be signed in to change notification settings - Fork 459
GH-519: [Variant] Disambiguate SQL NULL (missing) from Variant null #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cashmand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, I agree that the use of null can be pretty ambiguous in this doc.
As I mentioned in a comment, I think it's useful to view this doc as a spec for translating a (metadata, value) pair (as described in VariantEncoding.md) into a shredded Parquet file (hopefully with concepts that translate fairly easily to other formats if desired), and back. So from that perspective, I think we should be using NULL to refer to the relevant definition level in Parquet, and null to refer to the original encoding in value, or the encoding in a shredded value field. If that's the case, I don't think we'd ever need to talk about a NULL value appearing in the Variant, except for the case where the entire (metadata, value) pair to be translated is missing/NULL, which I think is more of a parquet/engine concept, although we can mention it here if you think there's ambiguity about how it should be handled.
VariantShredding.md
Outdated
|
|
||
| Unless the value is shredded as an object (see [Objects](#objects)), `typed_value` or `value` (but not both) must be non-null. | ||
| Unless the value is shredded as an object (see [Objects](#objects)), exactly one of `typed_value` or `value` must be present in each row. | ||
| Missing values in an optional group are encoded as missing values of the entire group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this sentence a bit hard to make sense of. I think this sentence is really a comment on the parquet-level logical Variant value, where I think NULL instead of missing would be more consistent with terminology used in the rest of parquet. Maybe we could use an example with a full parquet schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried rewording a bit, see what you think?
(If we still need a full schema we could do that)
VariantShredding.md
Outdated
|
|
||
| If the value is not an array, `typed_value` must be null. | ||
| If the value is an array, `value` must be null. | ||
| If the value is not an array, `typed_value` must be missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a matter of preference, so I'm fine with being overruled, but my preference is to use missing to refer to a value that is absent from the logical Variant value. E.g. in {"a": 1, "c": 3}, b is missing (even if present in the shredding spec). But when talking about encoding in Parquet, I think it makes more sense to use NULL, which I think is more consistent with the rest of the Parquet spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I was trying to follow the existing wording as closely as possible, and "missing" was used vastly more often than "NULL." But the real goal is clarity, not conformity to existing wording!
VariantShredding.md
Outdated
| The list `element` must be a required group. | ||
| The `element` group can contain `value` and `typed_value` fields. | ||
| The element's `value` field stores the element as Variant-encoded `binary` when the `typed_value` is not present or cannot represent it. | ||
| The element's `value` field stores the element as Variant-encoded `binary` when the element is missing or `typed_value` cannot represent it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the element is missing means here. I think the intent of the original wording was that if typed_value is not present in the shredding schema, then value is effectively required (although that's arguably a degenerate case of typed_value cannot represent it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, that makes sense
VariantShredding.md
Outdated
| That is, either `typed_value` or `value` (but not both) must be non-null. | ||
| Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
| All elements of a variant array must be present (not missing) because the `array` Variant encoding does not allow missing elements. | ||
| That is, either `typed_value` or `value` (but not both) must be present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (not on your change): I think (but not both) is wrong and can be removed. If the shredding schema is array-of-object, then the array elements can have both value and typed_value present, where value contains the object fields that do not exist in the array-of-object shredding schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that be a separate issue/PR since it would be a "behavior change" of sorts?
Or did you want me to adjust the wording here?
VariantShredding.md
Outdated
| A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant. | ||
| To encode a field that is present with a null value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
| A field's `value` and `typed_value` are both missing to indicate that the field does not exist in the variant. | ||
| To encode a field that is present with a SQL NULL value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent of this section is about converting a Variant binary to shredded Variant, so it should be Variant null in both parts of the sentence, not SQL NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-examining this, I'm pretty sure the first references really is SQL NULL (both value and typed value are just plain missing), but that the second is indeed null?
VariantShredding.md
Outdated
| Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
| All elements of a variant array must be present (not missing) because the `array` Variant encoding does not allow missing elements. | ||
| That is, either `typed_value` or `value` (but not both) must be present. | ||
| When converting an array with (SQL) nullable elements to variant, missing (SQL NULL) elements must be encoded in `value` as Variant null (`00`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment on object fields: I think the purpose of this VariantShredding.md doc is to provide a spec for translating a Variant binary (a metadata, value pair) into a shredded representation in parquet. In that context, it doesn't really make sense to talk about an array with (SQL) nullable elements. The array must already be a Variant array, and the Variant binary spec has no way to represent SQL NULL as an array element, only Variant null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Adjusted the wording a bit to hopefully capture this better.
VariantShredding.md
Outdated
| To encode a field that is present with a SQL NULL value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
|
|
||
| When both `value` and `typed_value` for a field are non-null, engines should fail. | ||
| When both `value` and `typed_value` for a field are present, engines should fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to your comment on array elements... isn't it legal for both to be present if the object field is itself a partially shredded sub-object?
I wonder if the spec is trying to say that it's illegal for value (a variant object) and typed_value (a struct) to both mention the same field... but that was already clearly stated at L181-183 above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think this is wrong. I think it's trying to describe the case where typed_value is an array or scalar. I'm not sure how to word it succinctly and clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a stab at it, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, thanks!
cashmand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
| # Data Skipping | ||
|
|
||
| Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing). | ||
| Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always NULL (missing). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Iceberg's current implementation also skips if all of the non-NULL values in value are null (i.e. 00), which can be determined from column stats. I'm not exactly sure what the intent of the data skipping rules here are; it seems like they require semantics for casting and comparing a Variant to a scalar, which the next few sentence seem to imply are not being defined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @rdblue might have some insights here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Iceberg checks for both. This is trying to be open-ended and not prevent people from doing clever things if they are safe.
VariantShredding.md
Outdated
| If the value is not an object, `typed_value` must be null. | ||
| Readers can assume that a value is not an object if `typed_value` is null and that `typed_value` field values are correct; that is, readers do not need to read the `value` column if `typed_value` fields satisfy the required fields. | ||
| If the value is an object, `typed_value` must be present. | ||
| If the value is not an object, `typed_value` must be missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NULL is clearer than missing here, since it's discussing the Parquet encoding.
|
|
||
| In order to avoid ambiguity, this specification always uses the term "`null`" to mean the variant | ||
| null value (binary encoding: `00`). The phrase "missing" or "NULL" (all caps) always refers to an | ||
| `optional` value that is not present (= SQL NULL). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's valuable to add a SQL null concept. It's probably helpful to call out an encoded Variant null, but null is already a thing in Parquet so it would be best to use the Parquet concept instead of adding another null that comes with a lot of baggage (3-value comparison).
It's also fine to try to visually disambiguate between them, but this introduces a lot of changes just from moving from null to NULL, so I would prefer doing that in some other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you suggest? The current spec wording has been causing a lot of confusion for arrow implementors, and this is the best I could come up with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can certainly replace "SQL NULL" with "parquet NULL", as a starting point (to not reference SQL anywhere).
But I don't think that's enough to disambiguate parquet null from variant null when the doc uses "null" to refer to both, sometimes in the same sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet spec for nullability calls them "NULL", not "null":
Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.
IMO using "NULL" to mean parquet null and "null" to mean variant null would be the clearest -- because it's quite visually distinct.
... but unfortunately the logical types spec uses lowercase "null" exclusively when referring to parquet nulls (except when it uses "null" to refer to variant null).
So perhaps another solution is to use (lowercase) "null" to mean "parquet null" or "null" to mean "variant null", but that's less visually distinct.
A third possibility could be to always say "parquet null" or "variant null" to disambiguate, but that will become wordy fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the last option is preferable, the spec is meant to be clear, not concise, wordy over ambiguous is always better in my opinion.
| Unless the value is shredded as an object (see [Objects](#objects)), `typed_value` or `value` (but not both) must be non-null. | ||
| Unless the value is shredded as an object (see [Objects](#objects)), exactly one of `typed_value` or `value` must be present in each row. | ||
|
|
||
| NULL values in an optional group are encoded in the usual way, with a definition level that excludes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is confusing because it draws a distinction between SQL NULL and how Parquet represents null values.
| The Parquet columns used to store variant metadata and values must be accessed by name, not by position. | ||
|
|
||
| In order to avoid ambiguity, this specification always uses the term "`null`" to mean the variant | ||
| null value (binary encoding: `00`). The phrase "missing" or "NULL" (all caps) always refers to an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spec, "missing" is used to refer to an object field that is not present in the object. For example, the field "type" is missing in this object: {"id": 34}. I don't think it is correct to conflate missing with a SQL null. Missing means the field is not present in an object.
| The list `element` must be a required group. | ||
| The `element` group can contain `value` and `typed_value` fields. | ||
| The element's `value` field stores the element as Variant-encoded `binary` when the `typed_value` is not present or cannot represent it. | ||
| The element's `value` field stores the element as Variant-encoded `binary` when the `typed_value` field is missing or cannot represent it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was using "not present" instead of "missing" to call out that the column was not in the physical Parquet schema. A "missing" object field can still be physically present in the Parquet schema because the schema is based on the shredded fields across Variant values. Objects encode missing as a Parquet null value for both value and typed_value Parquet columns.
In addition, this doc attempts to use "field" when referring to an field of a Variant object and "column" to refer to a physical Parquet column or field in a Parquet group. It's fine to state what is absent, but for consistency I think it should be "column" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec seems to have quite a collection of similarly-named concepts:
- [parquet] null vs. [Variant] null
- [parquet schema] field [in a group] vs. [Variant object] field
- [parquet schema] field vs. [parquet data] column
- missing [field of Variant object] vs. missing [column of parquet file]
Not sure the best way to clearly delineate them all without getting wordy...
| That is, either `typed_value` or `value` (but not both) must be non-null. | ||
| Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null). | ||
| All elements of a variant array must be present (not missing) because the `array` Variant arrays cannot encode missing (NULL) elements. | ||
| That is, at least one of `typed_value` or `value` must be present (possibly both, if the elements are partially shredded objects). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct to say "at least" one of the columns must be present and that both might be. This is a good change.
| The series of `tags` arrays `["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null` would be stored as: | ||
|
|
||
| | Array | `value` | `typed_value `| `typed_value...value` | `typed_value...typed_value` | | ||
| | Array | `value` | `typed_value `| `value...value` | `typed_value...typed_value` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was correct before. This uses the example schema above, which is helpful for understanding the structure:
optional group tags (VARIANT) {
required binary metadata;
optional binary value;
optional group typed_value (LIST) { # must be optional to allow a null list
repeated group list {
required group element { # shredded element
optional binary value;
optional binary typed_value (STRING);
}
}
}
}
The value column is binary and is used for any non-array. When the value is an array, typed_value is used. The ... is used as a placeholder to shorten typed_value.list.element.value and typed_value.list.element.typed_value.
| Readers can assume that a value is not an object if `typed_value` is null and that `typed_value` field values are correct; that is, readers do not need to read the `value` column if `typed_value` fields satisfy the required fields. | ||
| If the value is an object, `typed_value` must be present. | ||
| If the value is not an object, `typed_value` must be NULL. | ||
| Readers can assume that a value is not an object if `typed_value` is NULL and that `typed_value` field values are correct when present; that is, readers do not need to read the `value` column if `typed_value` fields satisfy the required fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of "when present" here is not correct. The aim is to make it clear that readers can project just the shredded fields if other fields in the object are not needed. For instance, variant_get(var_col, '$.type', 'string') only needs to project typed_value.type.value (or validate all null from stats) and typed_value.type.typed_value. If both values are null then the reader can assume that the field is missing from the object.
|
|
||
| | Event object | `value` | `typed_value` | `typed_value.event_type.value` | `typed_value.event_type.typed_value` | `typed_value.event_ts.value` | `typed_value.event_ts.typed_value` | Notes | | ||
| |------------------------------------------------------------------------------------|-----------------------------------|---------------|--------------------------------|--------------------------------------|------------------------------|------------------------------------|--------------------------------------------------| | ||
| | `{"event_type": "noop", "event_ts": 1729794114937}` | null | non-null | null | `noop` | null | 1729794114937 | Fully shredded object | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original intent was to avoid quotes and to present just the string contents. We should probably remove the quotes around the date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh. Strings are normally presented with quotes to make clear they're strings (= data) and not just words? It's a toss-up whether the date should have quotes -- they're certainly not "just" strings, but most engines present them as strings to the user.
| Readers may always assume that data is written correctly and that only `value` or `typed_value` is defined. | ||
| As a result, reads when both `value` and `typed_value` are defined may be inconsistent with optimized reads that require only one of the columns. | ||
| Readers may always assume that shredded data is written correctly, and that only one of `value` or `typed_value` is present (unless the field is a partially shredded object). | ||
| In particular, if a reader determines, based on the shredding schema, that a query needs only one of the two columns, the reader is not required to validate the other column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a lot of discussion about this and compromised on the wording "engines should fail".
I think this update is okay since it fits the intent that readers can projection just one and move on, but I'd like the people from the original discussion to sign off on changing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous wording was contradictory... it said that engines should fail and then turned around and said that engines could assume the data are always correct and can use one column without even looking at the other first.
| Readers may always assume that shredded data is written correctly, and that only one of `value` or `typed_value` is present (unless the field is a partially shredded object). | ||
| In particular, if a reader determines, based on the shredding schema, that a query needs only one of the two columns, the reader is not required to validate the other column. | ||
| A reader that accesses both both `value` and `typed_value` columns should fail if they are both non-NULL and the value is not a partially shredded object. | ||
| If readers choose to tolerate such cases, then the `typed_value` column must be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be controversial, but it is the behavior in Iceberg.
| If a Variant is missing in a context where a value is required, readers must return a Variant null (`00`): basic type 0 (primitive) and physical type 0 (null). | ||
| For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null). | ||
| NOTE: If the `measurement` group were `optional` instead of `required`, then rows with missing | ||
| values (SQL NULL) would be encoded by the entire group having missing values for those rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not allowed by the spec, which is why it was not included. I also think it is obvious that if the group is null then the columns should be treated as null when interpreting the value.
| | "n/a" | `01 00` v1/empty | `13 6E 2F 61` (`n/a`) | NULL | | ||
| | 100 | `01 00` v1/empty | NULL | `100` | | ||
|
|
||
| If a Variant is missing in a context where a value is required, readers must return a Variant null (`00`): basic type 0 (primitive) and physical type 0 (null). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still relevant because it tells readers how to handle cases where a missing value turns up where it shouldn't.
Rationale for this change
In variant shredding, values which are missing or SQL NULL are very different from Variant
nullvalues.What changes are included in this PR?
Update spec wording to consistently distinguish between "missing" or "NULL" values vs. Variant null or
nullvalues.Do these changes have PoC implementations?
N/A