-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update spec for v2 changes #2654
Conversation
@yyanyy, @aokolnychyi, and @openinx, can you take a look at these spec changes? This documents most of the remaining changes for v2. |
| _required_ | | **`partition-spec`**| The table’s current partition spec, stored as only fields. Note that this is used by writers to partition data, but is not used when reading because reads use the specs stored in manifest files. (**Deprecated**: use `partition-specs` and `default-spec-id`instead ) | | ||
| _optional_ | _required_ | **`partition-specs`**| A list of partition specs, stored as full partition spec objects. | | ||
| _optional_ | _required_ | **`default-spec-id`**| ID of the “current” spec that writers should use by default. | | ||
| _optional_ | _required_ | **`default-spec-id`**| ID of the "current" spec that writers should use by default. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a little bit inconsistent here. current-schema-id
vs default-spec-id
. I guess we can't rename default-spec-id
for compatibility reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This distinction is on purpose. A table can have only one schema and that is the current one. We track old schemas for older snapshots. But a table can have multiple valid partition specs and it is fine to write new data into either one. That's why we track a "default" spec to use when writing if you aren't doing something that overrides it like migrating data from one spec to another.
site/docs/spec.md
Outdated
|
||
Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines. | ||
|
||
Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit controversial as columns in a primary key are never nullable (even though identity fields are not technically a primary key) and null is never equal to null in SQL.
Would it be reasonable to follow the SQL standard and require the identity columns as well as columns used in equality deletes be always not null? My worry is that it is going to be tricky if we interpret this differently from SQL. What if I upsert by (1, 'a', null)? I think that row should match nothing from the SQL perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it so that only required columns are allowed. FYI @jackye1995, we'll need to update the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to remove this. My originally logic was that we the feature is limited if we can only add required fields, but that is only because we don't currently support default values. If we supported default values, then I think we would choose to avoid the ambiguity and only allow required columns. Since we plan to add default values, I think that it makes less sense to work around it by allowing optional fields here.
site/docs/spec.md
Outdated
|
||
Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines. | ||
|
||
Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that we treat null
and undefined in the same way (e.g. map value is null vs. map key is not present)? Seems like we might want to clarify that behavior at this point. Also not sure if there should be special consideration for other values like nan
or +/- infinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about NaN. I think we should consider all NaN values equal, even signaling NaNs.
For optional, I think we will probably remove this because it causes problems with SQL booleans. That limits use to only existing required columns until fields can have default values, but it is probably worth it to avoid SQL boolean confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to update this so that float and double columns aren't allowed. I think there are too many issues if we were to allow them and implementations would probably have bugs handling signalling NaN values and -0. We already don't allow floating points in some places where they make little sense (like bucket partitioning) so I think being conservative here is the right call.
Also, I doubt there are many valid use cases for having a floating point value in a row identifier.
And while I'm updating this, I'm also going to disallow float and double columns used for delete columns in equality delete files because of the same issues with signalling NaN and -0 values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it makes sense to exclude types that don't make sense: float, double, and possibly even boolean (since you could effectively only have three records: true, false, and null.
|
||
In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation. This assignment caused problems when reading metadata tables based on manifest files from multiple specs because partition fields with the same ID may contain different data types. For compatibility with old versions, the following rules are recommended for partition evolution in v1 tables: | ||
|
||
1. Do not reorder partition fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good summary, thanks for doing this!
This looks great to me but optional identity columns does not seem right. Do we have any systems where that's possible? |
I'm okay removing the ability to use optional columns. That's what @jackye1995 had originally and it aligns better with the SQL spec. |
trying to remember why we decided to make it optional: My current thinking is that it does not hurt to make it optional. If the engine enforces the primary key to be not nullable, it will fail in that engine during CREATE TABLE anyway, just like what we see in Flink. |
I think my confusion comes from the fact that PK means unique and not null and that having null in any column in the tuple we upsert by, means that tuple matches nothing from the SQL perspective. It feels easier to assume columns we use in equality deletes are not nullable. I agree about the downside of not being able to use optional columns added later as identity columns but I am not sure how important that is. We can probably cover that use case with default values that LinkedIn is working on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, @rdblue!
Okay, I will add the patch to disallow optional, float and double columns in the afternoon. |
Thanks, @jackye1995! Sorry that was my mistake in the first place. |
This updates the spec for v2 changes that are in master or pull requests:
identifier-field-ids
to schema (Core: add row key to format v2 #2354)schemas
andcurrent-schema-id
to table metadata (Spec: Add schema IDs, schema list, and associate each Snapshot with a schema #1029, Core: add schema id and schemas to table metadata #2096)key_metadata
to manifest lists (Core: add key_metadata in ManifestFile #2520)schema-id
to snapshot metadata (Core: add schema id to snapshot #2275)Closes #1141.