-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add lastAssignedFieldId to table metadata #2089
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change! LGTM.
One question: do you intend to add lastAssignedFieldId
to the JSON file in another PR, or you don't think it's necessary (so that this change is mostly to make lastAssignedFieldId
available in table metadata class)? I thought we want to persist it in JSON file, and if that's the case, I'll echo my question from this PR:
Is this just to ensure implementation correctness per spec? As it seems that from the current implementation getting this from PartitionSpec::lastAssignedFieldId should be good enough, as there's no table operation to actually remove a partition field from specs?
@yyanyy Thanks for the review. It seems table metadata is a better place to hold this info than partition spec updater as it is an info for a table, which might be used by other code paths in the future. |
The last assigned partition field ID does need to be added to the spec and the table metadata JSON. It is possible to reuse IDs, but I don't think it is a good idea to do it. There is also no need to reuse them. I think that reuse could only cause problems. |
I just thought of an example where ID reuse could be bad. The Nessie catalog allows branching at the table metadata level by keeping different metadata file references. A Nessie branch could remove data for one spec and reuse the IDs, which would make compatibility with another branch using the old spec confusing. |
@rdblue Thanks for the comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one comment
@@ -288,6 +292,9 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) { | |||
schema, TableMetadata.INITIAL_SPEC_ID, node.get(PARTITION_SPEC))); | |||
} | |||
|
|||
int lastAssignedFieldId = Optional.ofNullable(JsonUtil.getIntOrNull(LAST_ASSIGNED_FIELD_ID, node)) | |||
.orElseGet(() -> specs.stream().mapToInt(PartitionSpec::lastAssignedFieldId).max().orElse(999)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want it to be a required field in v2 and fail validation if it's missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense as the V2 is not released yet. I will update it accordingly.
Will update the doc after the change is merged. |
@rdblue can you please take another look? thanks. |
@@ -371,6 +374,10 @@ public PartitionSpec spec(int id) { | |||
return specsById; | |||
} | |||
|
|||
int lastAssignedFieldId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name is potentially confusing. It is the last assigned partition field ID, and "partition" is a key part of it. How about renaming it to lastAssignedPartitionId
instead? Same with the instance field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the field name in TableMetadata
.
In PartitionSpec
, there is also a lastAssignedFieldId
field. It seems to be fine as it is a field inside partition spec, which already indicates it is about partition.
@@ -288,6 +292,14 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) { | |||
schema, TableMetadata.INITIAL_SPEC_ID, node.get(PARTITION_SPEC))); | |||
} | |||
|
|||
int lastAssignedFieldId; | |||
if (formatVersion > 1) { | |||
lastAssignedFieldId = JsonUtil.getInt(LAST_ASSIGNED_FIELD_ID, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use a precondition to check that the last assigned partition ID is present? That would result in a better error message if it is missing, like the other cases: last-assigned-partition-id must exist in format v2
@@ -91,6 +92,7 @@ private TableMetadataParser() { | |||
static final String PARTITION_SPEC = "partition-spec"; | |||
static final String PARTITION_SPECS = "partition-specs"; | |||
static final String DEFAULT_SPEC_ID = "default-spec-id"; | |||
static final String LAST_ASSIGNED_FIELD_ID = "last-assigned-field-Id"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be last-assigned-partition-id
.
@@ -708,15 +716,16 @@ public TableMetadata buildReplacement(Schema updatedSchema, PartitionSpec update | |||
|
|||
return new TableMetadata(null, formatVersion, uuid, newLocation, | |||
lastSequenceNumber, System.currentTimeMillis(), newLastColumnId.get(), freshSchema, | |||
specId, specListBuilder.build(), orderId, sortOrdersBuilder.build(), ImmutableMap.copyOf(newProperties), | |||
specId, specListBuilder.build(), Math.max(lastAssignedFieldId, freshSpec.lastAssignedFieldId()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to fix the implementation of freshSpec
because it doesn't accept a lastAssignedFieldId
or reuse existing partition field IDs. In fact, I think what would currently happen is that the spec would be created independent of the rest of the table metadata and could reuse existing IDs. That's not good, but we also don't want to reassign all of the IDs (though that would be less bad).
We need to create a new spec similar to how we use assignFreshIds
for the schema, which will reuse IDs for existing fields and only assign new IDs for new fields, starting at the last assigned ID.
Thank you, @jun-he! I think there are few main improvements needed:
We can do all 3 in this PR, or just the first one if you prefer to separate them. Just let me know. |
@rdblue thanks for the comments. Updated the PR with the renamed field. I will send separate PRs for item 2 and 3. Thanks. |
"location": "s3://bucket/test/location", | ||
"last-sequence-number": 34, | ||
"last-updated-ms": 1602638573590, | ||
"last-column-id": 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this is last-column-id
and not last-assigned-column-id
. I think we should make the metadata property last-partition-id
instead of last-assigned-partition-id
to match. I'll merge this and we can fix it in a follow-up.
Thanks, @jun-he! |
Follow up #2083.
Add
lastAssignedFieldId
to table metadata.To keep it simple,
lastAssignedFieldId
across all partition specs will be computed at initialization without persisting to the JSON file.