Update spec for v2 changes #2654

rdblue · 2021-05-31T22:38:41Z

This updates the spec for v2 changes that are in master or pull requests:

Adds identifier-field-ids to schema (Core: add row key to format v2 #2354)
Adds partition evolution section (Spec: Document field-id requirements for partition specs #1025)
Adds schemas and current-schema-id to table metadata (Spec: Add schema IDs, schema list, and associate each Snapshot with a schema #1029, Core: add schema id and schemas to table metadata #2096)
Adds key_metadata to manifest lists (Core: add key_metadata in ManifestFile #2520)
Adds schema-id to snapshot metadata (Core: add schema id to snapshot #2275)

Closes #1141.

rdblue · 2021-05-31T22:40:22Z

@yyanyy, @aokolnychyi, and @openinx, can you take a look at these spec changes? This documents most of the remaining changes for v2.

stevenzwu · 2021-06-02T02:43:58Z

site/docs/spec.md

 | _required_ |            | **`partition-spec`**| The table’s current partition spec, stored as only fields. Note that this is used by writers to partition data, but is not used when reading because reads use the specs stored in manifest files. (**Deprecated**: use `partition-specs` and `default-spec-id`instead ) |
 | _optional_ | _required_ | **`partition-specs`**| A list of partition specs, stored as full partition spec objects. |
-| _optional_ | _required_ | **`default-spec-id`**| ID of the “current” spec that writers should use by default. |
+| _optional_ | _required_ | **`default-spec-id`**| ID of the "current" spec that writers should use by default. |


nit: a little bit inconsistent here. current-schema-id vs default-spec-id. I guess we can't rename default-spec-id for compatibility reason.

This distinction is on purpose. A table can have only one schema and that is the current one. We track old schemas for older snapshots. But a table can have multiple valid partition specs and it is fine to write new data into either one. That's why we track a "default" spec to use when writing if you aren't doing something that overrides it like migrating data from one spec to another.

aokolnychyi · 2021-06-15T04:20:51Z

site/docs/spec.md

+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.


This seems a bit controversial as columns in a primary key are never nullable (even though identity fields are not technically a primary key) and null is never equal to null in SQL.

Would it be reasonable to follow the SQL standard and require the identity columns as well as columns used in equality deletes be always not null? My worry is that it is going to be tricky if we interpret this differently from SQL. What if I upsert by (1, 'a', null)? I think that row should match nothing from the SQL perspective.

I'll change it so that only required columns are allowed. FYI @jackye1995, we'll need to update the implementation.

I'm going to remove this. My originally logic was that we the feature is limited if we can only add required fields, but that is only because we don't currently support default values. If we supported default values, then I think we would choose to avoid the ambiguity and only allow required columns. Since we plan to add default values, I think that it makes less sense to work around it by allowing optional fields here.

site/docs/spec.md

danielcweeks · 2021-06-15T04:29:35Z

site/docs/spec.md

+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.


I assume that we treat null and undefined in the same way (e.g. map value is null vs. map key is not present)? Seems like we might want to clarify that behavior at this point. Also not sure if there should be special consideration for other values like nan or +/- infinity

Good point about NaN. I think we should consider all NaN values equal, even signaling NaNs.

For optional, I think we will probably remove this because it causes problems with SQL booleans. That limits use to only existing required columns until fields can have default values, but it is probably worth it to avoid SQL boolean confusion.

I'm going to update this so that float and double columns aren't allowed. I think there are too many issues if we were to allow them and implementations would probably have bugs handling signalling NaN values and -0. We already don't allow floating points in some places where they make little sense (like bucket partitioning) so I think being conservative here is the right call.

Also, I doubt there are many valid use cases for having a floating point value in a row identifier.

And while I'm updating this, I'm also going to disallow float and double columns used for delete columns in equality delete files because of the same issues with signalling NaN and -0 values.

I agree it makes sense to exclude types that don't make sense: float, double, and possibly even boolean (since you could effectively only have three records: true, false, and null.

aokolnychyi · 2021-06-15T04:33:12Z

site/docs/spec.md

+
+In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation. This assignment caused problems when reading metadata tables based on manifest files from multiple specs because partition fields with the same ID may contain different data types. For compatibility with old versions, the following rules are recommended for partition evolution in v1 tables:
+
+1. Do not reorder partition fields


Good summary, thanks for doing this!

aokolnychyi · 2021-06-15T04:36:23Z

This looks great to me but optional identity columns does not seem right. Do we have any systems where that's possible?

rdblue · 2021-06-15T18:00:30Z

. . . optional identity columns does not seem right. Do we have any systems where that's possible?

I'm okay removing the ability to use optional columns. That's what @jackye1995 had originally and it aligns better with the SQL spec.

jackye1995 · 2021-06-15T18:17:27Z

trying to remember why we decided to make it optional:

#2465 (comment)

My current thinking is that it does not hurt to make it optional. If the engine enforces the primary key to be not nullable, it will fail in that engine during CREATE TABLE anyway, just like what we see in Flink.

aokolnychyi · 2021-06-15T22:37:52Z

I think my confusion comes from the fact that PK means unique and not null and that having null in any column in the tuple we upsert by, means that tuple matches nothing from the SQL perspective. It feels easier to assume columns we use in equality deletes are not nullable. I agree about the downside of not being able to use optional columns added later as identity columns but I am not sure how important that is. We can probably cover that use case with default values that LinkedIn is working on.

aokolnychyi

LGTM. Thanks, @rdblue!

jackye1995 · 2021-06-16T20:01:37Z

Okay, I will add the patch to disallow optional, float and double columns in the afternoon.

rdblue · 2021-06-16T20:02:31Z

Thanks, @jackye1995! Sorry that was my mistake in the first place.

rdblue added 4 commits May 31, 2021 13:10

Spec: Add identifier-field-ids to schema.

b8b8fc0

Spec: Add section for partition evolution.

3403c5f

Spec: Add schemas list and current-schema-id to table metadata.

e647c3a

Spec: Add key_metadata to manifest list.

996fe7c

github-actions bot added the docs label May 31, 2021

rdblue mentioned this pull request May 31, 2021

Core: add key_metadata in ManifestFile #2520

Closed

Spec: Add schema-id to Snapshot metadata.

424a0e1

stevenzwu reviewed Jun 2, 2021

View reviewed changes

aokolnychyi reviewed Jun 15, 2021

View reviewed changes

danielcweeks reviewed Jun 15, 2021

View reviewed changes

site/docs/spec.md Outdated Show resolved Hide resolved

danielcweeks reviewed Jun 15, 2021

View reviewed changes

aokolnychyi reviewed Jun 15, 2021

View reviewed changes

Update for review comments.

a5f37f7

aokolnychyi approved these changes Jun 15, 2021

View reviewed changes

danielcweeks approved these changes Jun 16, 2021

View reviewed changes

rdblue merged commit e79580b into apache:master Jun 16, 2021

jackye1995 mentioned this pull request Jun 16, 2021

Core: do not allow optional, double and float identifier fields #2705

Merged

rdblue mentioned this pull request Jun 29, 2021

Spec: Document field-id requirements for partition specs #1025

Closed

rdblue mentioned this pull request Jul 12, 2021

Spec: Add schema IDs, schema list, and associate each Snapshot with a schema #1029

Closed

zhongyujiang mentioned this pull request Apr 10, 2023

identifier-field-ids not supported for Float or double #7302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update spec for v2 changes #2654

Update spec for v2 changes #2654

rdblue commented May 31, 2021 •

edited

Loading

rdblue commented May 31, 2021

stevenzwu Jun 2, 2021

rdblue Jun 2, 2021

aokolnychyi Jun 15, 2021

rdblue Jun 15, 2021

rdblue Jun 15, 2021

danielcweeks Jun 15, 2021

rdblue Jun 15, 2021

rdblue Jun 15, 2021

danielcweeks Jun 16, 2021

aokolnychyi Jun 15, 2021

aokolnychyi commented Jun 15, 2021

rdblue commented Jun 15, 2021

jackye1995 commented Jun 15, 2021

aokolnychyi commented Jun 15, 2021 •

edited

Loading

aokolnychyi left a comment

jackye1995 commented Jun 16, 2021

rdblue commented Jun 16, 2021


		Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.

		Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.


		In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation. This assignment caused problems when reading metadata tables based on manifest files from multiple specs because partition fields with the same ID may contain different data types. For compatibility with old versions, the following rules are recommended for partition evolution in v1 tables:

		1. Do not reorder partition fields

Update spec for v2 changes #2654

Update spec for v2 changes #2654

Conversation

rdblue commented May 31, 2021 • edited Loading

rdblue commented May 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Jun 15, 2021

rdblue commented Jun 15, 2021

jackye1995 commented Jun 15, 2021

aokolnychyi commented Jun 15, 2021 • edited Loading

aokolnychyi left a comment

Choose a reason for hiding this comment

jackye1995 commented Jun 16, 2021

rdblue commented Jun 16, 2021

rdblue commented May 31, 2021 •

edited

Loading

aokolnychyi commented Jun 15, 2021 •

edited

Loading