Skip to content

Conversation

@scovich
Copy link

@scovich scovich commented Aug 25, 2025

Rationale for this change

In variant shredding, values which are missing or SQL NULL are very different from Variant null values.

What changes are included in this PR?

Update spec wording to consistently distinguish between "missing" or "NULL" values vs. Variant null or null values.

Do these changes have PoC implementations?

N/A

Copy link

@cashmand cashmand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I agree that the use of null can be pretty ambiguous in this doc.

As I mentioned in a comment, I think it's useful to view this doc as a spec for translating a (metadata, value) pair (as described in VariantEncoding.md) into a shredded Parquet file (hopefully with concepts that translate fairly easily to other formats if desired), and back. So from that perspective, I think we should be using NULL to refer to the relevant definition level in Parquet, and null to refer to the original encoding in value, or the encoding in a shredded value field. If that's the case, I don't think we'd ever need to talk about a NULL value appearing in the Variant, except for the case where the entire (metadata, value) pair to be translated is missing/NULL, which I think is more of a parquet/engine concept, although we can mention it here if you think there's ambiguity about how it should be handled.


Unless the value is shredded as an object (see [Objects](#objects)), `typed_value` or `value` (but not both) must be non-null.
Unless the value is shredded as an object (see [Objects](#objects)), exactly one of `typed_value` or `value` must be present in each row.
Missing values in an optional group are encoded as missing values of the entire group.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this sentence a bit hard to make sense of. I think this sentence is really a comment on the parquet-level logical Variant value, where I think NULL instead of missing would be more consistent with terminology used in the rest of parquet. Maybe we could use an example with a full parquet schema.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried rewording a bit, see what you think?
(If we still need a full schema we could do that)


If the value is not an array, `typed_value` must be null.
If the value is an array, `value` must be null.
If the value is not an array, `typed_value` must be missing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a matter of preference, so I'm fine with being overruled, but my preference is to use missing to refer to a value that is absent from the logical Variant value. E.g. in {"a": 1, "c": 3}, b is missing (even if present in the shredding spec). But when talking about encoding in Parquet, I think it makes more sense to use NULL, which I think is more consistent with the rest of the Parquet spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I was trying to follow the existing wording as closely as possible, and "missing" was used vastly more often than "NULL." But the real goal is clarity, not conformity to existing wording!

The list `element` must be a required group.
The `element` group can contain `value` and `typed_value` fields.
The element's `value` field stores the element as Variant-encoded `binary` when the `typed_value` is not present or cannot represent it.
The element's `value` field stores the element as Variant-encoded `binary` when the element is missing or `typed_value` cannot represent it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the element is missing means here. I think the intent of the original wording was that if typed_value is not present in the shredding schema, then value is effectively required (although that's arguably a degenerate case of typed_value cannot represent it).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that makes sense

That is, either `typed_value` or `value` (but not both) must be non-null.
Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null).
All elements of a variant array must be present (not missing) because the `array` Variant encoding does not allow missing elements.
That is, either `typed_value` or `value` (but not both) must be present.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (not on your change): I think (but not both) is wrong and can be removed. If the shredding schema is array-of-object, then the array elements can have both value and typed_value present, where value contains the object fields that do not exist in the array-of-object shredding schema.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be a separate issue/PR since it would be a "behavior change" of sorts?
Or did you want me to adjust the wording here?

A field's `value` and `typed_value` are set to null (missing) to indicate that the field does not exist in the variant.
To encode a field that is present with a null value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).
A field's `value` and `typed_value` are both missing to indicate that the field does not exist in the variant.
To encode a field that is present with a SQL NULL value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent of this section is about converting a Variant binary to shredded Variant, so it should be Variant null in both parts of the sentence, not SQL NULL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-examining this, I'm pretty sure the first references really is SQL NULL (both value and typed value are just plain missing), but that the second is indeed null?

Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null).
All elements of a variant array must be present (not missing) because the `array` Variant encoding does not allow missing elements.
That is, either `typed_value` or `value` (but not both) must be present.
When converting an array with (SQL) nullable elements to variant, missing (SQL NULL) elements must be encoded in `value` as Variant null (`00`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment on object fields: I think the purpose of this VariantShredding.md doc is to provide a spec for translating a Variant binary (a metadata, value pair) into a shredded representation in parquet. In that context, it doesn't really make sense to talk about an array with (SQL) nullable elements. The array must already be a Variant array, and the Variant binary spec has no way to represent SQL NULL as an array element, only Variant null.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Adjusted the wording a bit to hopefully capture this better.

To encode a field that is present with a SQL NULL value, the `value` must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).

When both `value` and `typed_value` for a field are non-null, engines should fail.
When both `value` and `typed_value` for a field are present, engines should fail.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to your comment on array elements... isn't it legal for both to be present if the object field is itself a partially shredded sub-object?

I wonder if the spec is trying to say that it's illegal for value (a variant object) and typed_value (a struct) to both mention the same field... but that was already clearly stated at L181-183 above?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this is wrong. I think it's trying to describe the case where typed_value is an array or scalar. I'm not sure how to word it succinctly and clearly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at it, what do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks!

Copy link

@cashmand cashmand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

# Data Skipping

Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).
Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always NULL (missing).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Iceberg's current implementation also skips if all of the non-NULL values in value are null (i.e. 00), which can be determined from column stats. I'm not exactly sure what the intent of the data skipping rules here are; it seems like they require semantics for casting and comparing a Variant to a scalar, which the next few sentence seem to imply are not being defined here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @rdblue might have some insights here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Iceberg checks for both. This is trying to be open-ended and not prevent people from doing clever things if they are safe.

If the value is not an object, `typed_value` must be null.
Readers can assume that a value is not an object if `typed_value` is null and that `typed_value` field values are correct; that is, readers do not need to read the `value` column if `typed_value` fields satisfy the required fields.
If the value is an object, `typed_value` must be present.
If the value is not an object, `typed_value` must be missing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NULL is clearer than missing here, since it's discussing the Parquet encoding.


In order to avoid ambiguity, this specification always uses the term "`null`" to mean the variant
null value (binary encoding: `00`). The phrase "missing" or "NULL" (all caps) always refers to an
`optional` value that is not present (= SQL NULL).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's valuable to add a SQL null concept. It's probably helpful to call out an encoded Variant null, but null is already a thing in Parquet so it would be best to use the Parquet concept instead of adding another null that comes with a lot of baggage (3-value comparison).

It's also fine to try to visually disambiguate between them, but this introduces a lot of changes just from moving from null to NULL, so I would prefer doing that in some other way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest? The current spec wording has been causing a lot of confusion for arrow implementors, and this is the best I could come up with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can certainly replace "SQL NULL" with "parquet NULL", as a starting point (to not reference SQL anywhere).

But I don't think that's enough to disambiguate parquet null from variant null when the doc uses "null" to refer to both, sometimes in the same sentence.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet spec for nullability calls them "NULL", not "null":

Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

IMO using "NULL" to mean parquet null and "null" to mean variant null would be the clearest -- because it's quite visually distinct.

... but unfortunately the logical types spec uses lowercase "null" exclusively when referring to parquet nulls (except when it uses "null" to refer to variant null).

So perhaps another solution is to use (lowercase) "null" to mean "parquet null" or "null" to mean "variant null", but that's less visually distinct.

A third possibility could be to always say "parquet null" or "variant null" to disambiguate, but that will become wordy fast.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last option is preferable, the spec is meant to be clear, not concise, wordy over ambiguous is always better in my opinion.

Unless the value is shredded as an object (see [Objects](#objects)), `typed_value` or `value` (but not both) must be non-null.
Unless the value is shredded as an object (see [Objects](#objects)), exactly one of `typed_value` or `value` must be present in each row.

NULL values in an optional group are encoded in the usual way, with a definition level that excludes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is confusing because it draws a distinction between SQL NULL and how Parquet represents null values.

The Parquet columns used to store variant metadata and values must be accessed by name, not by position.

In order to avoid ambiguity, this specification always uses the term "`null`" to mean the variant
null value (binary encoding: `00`). The phrase "missing" or "NULL" (all caps) always refers to an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spec, "missing" is used to refer to an object field that is not present in the object. For example, the field "type" is missing in this object: {"id": 34}. I don't think it is correct to conflate missing with a SQL null. Missing means the field is not present in an object.

The list `element` must be a required group.
The `element` group can contain `value` and `typed_value` fields.
The element's `value` field stores the element as Variant-encoded `binary` when the `typed_value` is not present or cannot represent it.
The element's `value` field stores the element as Variant-encoded `binary` when the `typed_value` field is missing or cannot represent it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was using "not present" instead of "missing" to call out that the column was not in the physical Parquet schema. A "missing" object field can still be physically present in the Parquet schema because the schema is based on the shredded fields across Variant values. Objects encode missing as a Parquet null value for both value and typed_value Parquet columns.

In addition, this doc attempts to use "field" when referring to an field of a Variant object and "column" to refer to a physical Parquet column or field in a Parquet group. It's fine to state what is absent, but for consistency I think it should be "column" here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec seems to have quite a collection of similarly-named concepts:

  • [parquet] null vs. [Variant] null
  • [parquet schema] field [in a group] vs. [Variant object] field
  • [parquet schema] field vs. [parquet data] column
  • missing [field of Variant object] vs. missing [column of parquet file]

Not sure the best way to clearly delineate them all without getting wordy...

That is, either `typed_value` or `value` (but not both) must be non-null.
Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null).
All elements of a variant array must be present (not missing) because the `array` Variant arrays cannot encode missing (NULL) elements.
That is, at least one of `typed_value` or `value` must be present (possibly both, if the elements are partially shredded objects).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is correct to say "at least" one of the columns must be present and that both might be. This is a good change.

The series of `tags` arrays `["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null` would be stored as:

| Array | `value` | `typed_value `| `typed_value...value` | `typed_value...typed_value` |
| Array | `value` | `typed_value `| `value...value` | `typed_value...typed_value` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was correct before. This uses the example schema above, which is helpful for understanding the structure:

optional group tags (VARIANT) {
  required binary metadata;
  optional binary value;
  optional group typed_value (LIST) {   # must be optional to allow a null list
    repeated group list {
      required group element {          # shredded element
        optional binary value;
        optional binary typed_value (STRING);
      }
    }
  }
}

The value column is binary and is used for any non-array. When the value is an array, typed_value is used. The ... is used as a placeholder to shorten typed_value.list.element.value and typed_value.list.element.typed_value.

Readers can assume that a value is not an object if `typed_value` is null and that `typed_value` field values are correct; that is, readers do not need to read the `value` column if `typed_value` fields satisfy the required fields.
If the value is an object, `typed_value` must be present.
If the value is not an object, `typed_value` must be NULL.
Readers can assume that a value is not an object if `typed_value` is NULL and that `typed_value` field values are correct when present; that is, readers do not need to read the `value` column if `typed_value` fields satisfy the required fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of "when present" here is not correct. The aim is to make it clear that readers can project just the shredded fields if other fields in the object are not needed. For instance, variant_get(var_col, '$.type', 'string') only needs to project typed_value.type.value (or validate all null from stats) and typed_value.type.typed_value. If both values are null then the reader can assume that the field is missing from the object.


| Event object | `value` | `typed_value` | `typed_value.event_type.value` | `typed_value.event_type.typed_value` | `typed_value.event_ts.value` | `typed_value.event_ts.typed_value` | Notes |
|------------------------------------------------------------------------------------|-----------------------------------|---------------|--------------------------------|--------------------------------------|------------------------------|------------------------------------|--------------------------------------------------|
| `{"event_type": "noop", "event_ts": 1729794114937}` | null | non-null | null | `noop` | null | 1729794114937 | Fully shredded object |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original intent was to avoid quotes and to present just the string contents. We should probably remove the quotes around the date.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh. Strings are normally presented with quotes to make clear they're strings (= data) and not just words? It's a toss-up whether the date should have quotes -- they're certainly not "just" strings, but most engines present them as strings to the user.

Readers may always assume that data is written correctly and that only `value` or `typed_value` is defined.
As a result, reads when both `value` and `typed_value` are defined may be inconsistent with optimized reads that require only one of the columns.
Readers may always assume that shredded data is written correctly, and that only one of `value` or `typed_value` is present (unless the field is a partially shredded object).
In particular, if a reader determines, based on the shredding schema, that a query needs only one of the two columns, the reader is not required to validate the other column.
Copy link
Contributor

@rdblue rdblue Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a lot of discussion about this and compromised on the wording "engines should fail".

I think this update is okay since it fits the intent that readers can projection just one and move on, but I'd like the people from the original discussion to sign off on changing it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous wording was contradictory... it said that engines should fail and then turned around and said that engines could assume the data are always correct and can use one column without even looking at the other first.

Readers may always assume that shredded data is written correctly, and that only one of `value` or `typed_value` is present (unless the field is a partially shredded object).
In particular, if a reader determines, based on the shredding schema, that a query needs only one of the two columns, the reader is not required to validate the other column.
A reader that accesses both both `value` and `typed_value` columns should fail if they are both non-NULL and the value is not a partially shredded object.
If readers choose to tolerate such cases, then the `typed_value` column must be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be controversial, but it is the behavior in Iceberg.

If a Variant is missing in a context where a value is required, readers must return a Variant null (`00`): basic type 0 (primitive) and physical type 0 (null).
For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).
NOTE: If the `measurement` group were `optional` instead of `required`, then rows with missing
values (SQL NULL) would be encoded by the entire group having missing values for those rows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not allowed by the spec, which is why it was not included. I also think it is obvious that if the group is null then the columns should be treated as null when interpreting the value.

| "n/a" | `01 00` v1/empty | `13 6E 2F 61` (`n/a`) | NULL |
| 100 | `01 00` v1/empty | NULL | `100` |

If a Variant is missing in a context where a value is required, readers must return a Variant null (`00`): basic type 0 (primitive) and physical type 0 (null).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still relevant because it tells readers how to handle cases where a missing value turns up where it shouldn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants