Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lastAssignedFieldId to table metadata #2089

Merged
merged 4 commits into from
Jan 30, 2021

Conversation

jun-he
Copy link
Collaborator

@jun-he jun-he commented Jan 14, 2021

Follow up #2083.
Add lastAssignedFieldId to table metadata.
To keep it simple, lastAssignedFieldId across all partition specs will be computed at initialization without persisting to the JSON file.

@github-actions github-actions bot added the core label Jan 14, 2021
Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change! LGTM.

One question: do you intend to add lastAssignedFieldId to the JSON file in another PR, or you don't think it's necessary (so that this change is mostly to make lastAssignedFieldId available in table metadata class)? I thought we want to persist it in JSON file, and if that's the case, I'll echo my question from this PR:

Is this just to ensure implementation correctness per spec? As it seems that from the current implementation getting this from PartitionSpec::lastAssignedFieldId should be good enough, as there's no table operation to actually remove a partition field from specs?

@jun-he
Copy link
Collaborator Author

jun-he commented Jan 15, 2021

@yyanyy Thanks for the review.
I thought about adding it to metadata JSON file. It seems unnecessary and also adds additional complexity for backward compatibility and maintains its correctness. As the info is already included in specs, we can simply recompute it during initialization.

It seems table metadata is a better place to hold this info than partition spec updater as it is an info for a table, which might be used by other code paths in the future.

@rdblue
Copy link
Contributor

rdblue commented Jan 15, 2021

The last assigned partition field ID does need to be added to the spec and the table metadata JSON. It is possible to reuse IDs, but I don't think it is a good idea to do it. There is also no need to reuse them. I think that reuse could only cause problems.

@rdblue
Copy link
Contributor

rdblue commented Jan 15, 2021

I just thought of an example where ID reuse could be bad. The Nessie catalog allows branching at the table metadata level by keeping different metadata file references. A Nessie branch could remove data for one spec and reuse the IDs, which would make compatibility with another branch using the old spec confusing.

@jun-he
Copy link
Collaborator Author

jun-he commented Jan 16, 2021

@rdblue Thanks for the comment.
It makes sense to add the lastAssignedFieldId to metadata JSON when specs in metadata might be removed.
We already keep the lastAssignedFieldId in the spec JSON. I will add lastAssignedFieldId to table metadata JSON as well.

Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one comment

@@ -288,6 +292,9 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) {
schema, TableMetadata.INITIAL_SPEC_ID, node.get(PARTITION_SPEC)));
}

int lastAssignedFieldId = Optional.ofNullable(JsonUtil.getIntOrNull(LAST_ASSIGNED_FIELD_ID, node))
.orElseGet(() -> specs.stream().mapToInt(PartitionSpec::lastAssignedFieldId).max().orElse(999));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want it to be a required field in v2 and fail validation if it's missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense as the V2 is not released yet. I will update it accordingly.

@jun-he
Copy link
Collaborator Author

jun-he commented Jan 24, 2021

Will update the doc after the change is merged.

@jun-he
Copy link
Collaborator Author

jun-he commented Jan 27, 2021

@rdblue can you please take another look? thanks.

@@ -371,6 +374,10 @@ public PartitionSpec spec(int id) {
return specsById;
}

int lastAssignedFieldId() {
Copy link
Contributor

@rdblue rdblue Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name is potentially confusing. It is the last assigned partition field ID, and "partition" is a key part of it. How about renaming it to lastAssignedPartitionId instead? Same with the instance field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the field name in TableMetadata.
In PartitionSpec, there is also a lastAssignedFieldId field. It seems to be fine as it is a field inside partition spec, which already indicates it is about partition.

@@ -288,6 +292,14 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) {
schema, TableMetadata.INITIAL_SPEC_ID, node.get(PARTITION_SPEC)));
}

int lastAssignedFieldId;
if (formatVersion > 1) {
lastAssignedFieldId = JsonUtil.getInt(LAST_ASSIGNED_FIELD_ID, node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use a precondition to check that the last assigned partition ID is present? That would result in a better error message if it is missing, like the other cases: last-assigned-partition-id must exist in format v2

@@ -91,6 +92,7 @@ private TableMetadataParser() {
static final String PARTITION_SPEC = "partition-spec";
static final String PARTITION_SPECS = "partition-specs";
static final String DEFAULT_SPEC_ID = "default-spec-id";
static final String LAST_ASSIGNED_FIELD_ID = "last-assigned-field-Id";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be last-assigned-partition-id.

@@ -708,15 +716,16 @@ public TableMetadata buildReplacement(Schema updatedSchema, PartitionSpec update

return new TableMetadata(null, formatVersion, uuid, newLocation,
lastSequenceNumber, System.currentTimeMillis(), newLastColumnId.get(), freshSchema,
specId, specListBuilder.build(), orderId, sortOrdersBuilder.build(), ImmutableMap.copyOf(newProperties),
specId, specListBuilder.build(), Math.max(lastAssignedFieldId, freshSpec.lastAssignedFieldId()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to fix the implementation of freshSpec because it doesn't accept a lastAssignedFieldId or reuse existing partition field IDs. In fact, I think what would currently happen is that the spec would be created independent of the rest of the table metadata and could reuse existing IDs. That's not good, but we also don't want to reassign all of the IDs (though that would be less bad).

We need to create a new spec similar to how we use assignFreshIds for the schema, which will reuse IDs for existing fields and only assign new IDs for new fields, starting at the last assigned ID.

@rdblue
Copy link
Contributor

rdblue commented Jan 29, 2021

Thank you, @jun-he! I think there are few main improvements needed:

  1. We should use lastAssignedPartitionId instead of lastAssignedFieldId in all cases to be more specific
  2. For a table replace, we need to reassign the partition field IDs and reuse any existing IDs for equivalent fields
  3. We need to update the spec doc

We can do all 3 in this PR, or just the first one if you prefer to separate them. Just let me know.

@jun-he
Copy link
Collaborator Author

jun-he commented Jan 30, 2021

@rdblue thanks for the comments. Updated the PR with the renamed field. I will send separate PRs for item 2 and 3. Thanks.

"location": "s3://bucket/test/location",
"last-sequence-number": 34,
"last-updated-ms": 1602638573590,
"last-column-id": 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this is last-column-id and not last-assigned-column-id. I think we should make the metadata property last-partition-id instead of last-assigned-partition-id to match. I'll merge this and we can fix it in a follow-up.

@rdblue rdblue merged commit b3e01b7 into apache:master Jan 30, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 30, 2021

Thanks, @jun-he!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants