Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add TableMetadata update builder #2957

Closed
wants to merge 1 commit into from

Conversation

jackye1995
Copy link
Contributor

continuation of #2887, as discussed with @rdblue, introduce a table metadata builder to cleanup the code.

I am separating the builder to a different class to avoid making the TableMetadata class too long. I had to add a few package-private getters but I think that's a reasonable trade off.

If we agree on this implementation, I will also update tests to use the builder instead of the TableMetadata constructor.

@jackye1995 jackye1995 requested a review from rdblue August 11, 2021 05:42
@github-actions github-actions bot added the core label Aug 11, 2021
Preconditions.checkNotNull(lastColumnId, "Last column ID not assigned for table metadata");
Preconditions.checkArgument(schemas != null && !schemas.isEmpty(),
"Table metadata must have at least 1 schema");
Preconditions.checkNotNull(currentSchemaId, "Table metadata must have the current schema ID");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we can calculate the schema, spec, partition field and sort order id using a max, but for all use cases I see, we always know the actual values of these IDs and can avoid the recomputation, so I just decide to throw exception if it is not set by the caller.

}

public TableMetadata upgradeToFormatVersion(int newFormatVersion) {
Preconditions.checkArgument(newFormatVersion <= SUPPORTED_TABLE_FORMAT_VERSION,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic here is moved to the builder. What we are doing here is simple precondition checks and then directly run the constructor logic with the new format version, so the logic in the original PR to call upgradeToFormatVersion is unnecessary, and can be centralized to the build logic.

lastSequenceNumber, lastUpdatedMillis, lastAssignedColumnId, currentSchemaId, schemas, defaultSpecId, specs,
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentVersionId,
snapshots, entries.build(), metadataEntries.build());
return TableMetadata.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of using the builder here instead of instantiating TableMetadata directly? Looks like it made the method too long! I don't think that this needs to use the builder because there isn't anything to validate or change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit of a builder for this class would be using it to accumulate state as it is read, rather than keeping the local variables.

For example, rather than calling snapshots.add and then calling withSnapshots(snapshots), the builder could help by exposing a addSnapshot method. But before adding that, I think it's worth considering what the purpose of the builder is, and I don't think that it is to make parsing easier.

To me, the purpose of the builder is to make updating table metadata easier. That's what I initially introduced the high-level methods for, like TableMetadata.replaceCurrentSnapshot. I think a builder pattern works better for implementing those methods since you can make multiple changes in the same builder instead of building a new TableMetadata and calling a different high-level method on it.

So in the end, I don't think that I would update this just yet like I said in my first comment. And I would probably change the API of the builder in general. If you want to use a builder here, then maybe two builders are appropriate. One for building from existing table metadata and one for just constructing the TableMetadata objects.

import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;

public class TableMetadataBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public? I would prefer not making the builder public for now.

formatVersion, TableMetadata.SUPPORTED_TABLE_FORMAT_VERSION);
Preconditions.checkArgument(formatVersion >= baseFormatVersion,
"Cannot downgrade v%s table to v%s", baseFormatVersion, formatVersion);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally prefer to structure builders so that validation is done as soon as possible. That keeps build methods smaller. This could be moved into the withFormatVersion method.

@jackye1995 jackye1995 changed the title Core: introduce TableMetadataBuilder Core: add TableMetadata update builder Sep 13, 2021
@jackye1995
Copy link
Contributor Author

@rdblue thanks for the review! Based on the conversation, I try to first introduce the builder for update only. To indicate the builder is for update only, I used name TableMetadataUpdateBuilder instead of TableMetadataBuilder.

I was thinking about using the same builder for both update and new metadata, but there might be some unnecessary complications to mix those 2 together, because we need to set defaults for any null values when building new metadata, whereas we want to inherit whatever was in the base metadata without any defaults in an update. So my current stance is that we can introduce another builder later for new table metadata (when that is necessary), and evaluate if it is worth letting them sharing some common class or not.

For the feature of accumulating states, I am leaving it out of this PR for now. I tried to implement it, but in update builder we need to recreate a list anyway for whatever we need to overwrite from the old metadata, so the benefit for building the list inside of builder is not obvious compared to building it outside and then overwrite the whole list.

@rdblue
Copy link
Contributor

rdblue commented Sep 16, 2021

I was imagining the builder would work a little differently. There's definitely some value in the regular builder pattern where you can mostly copy fields and just make calls to change values. But I think adding a builder could bring more value if we actually moved the implementations of the TableMetadata modifications into the builder. So what I was thinking was to basically move all of the methods like replaceCurrentSnapshot and removeSnapshotsIf into the builder itself.

That approach allows us to make more than one change without adding special handling for upgradeToFormatVersion. We could chain changes to produce a new TableMetadata because the methods keep track of table state.

A second benefit that I'm interested in is tracking the high-level changes the led to a new TableMetadata object. We're building a REST-based catalog and I think this is the first step to an update that posts just what changed. For example, replaceCurrentSnapshot could produce a new TableMetadata, but also keep track of the facts that (1) a new snapshot was added and (2) the current snapshot was set to that snapshot's ID. Those granular changes make a REST protocol much more efficient.

Does that motivation make sense?

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if everything that needs to be updated is explicitly set via builder methods as right now it's a mix of updating things via builder methods and implicitly setting stuff in build() itself.
Also I'm not sure if in most cases withInputFile(null) is missing from the builder calls

lastSequenceNumber, lastUpdatedMillis, lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties,
currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
return builderFrom(this).generateUUID().build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also have a .withFile(null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's null by default, is it worth adding it explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, not worth adding it explicitly. I was confused with baseFile and file

newSchemaId, builder.build(), defaultSpecId, updatedSpecs, lastAssignedPartitionId,
defaultSortOrderId, updatedSortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
addPreviousFile(file, lastUpdatedMillis));
return builderFrom(this)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have a .withFile(null) and .withPreviousFiles(addPreviousFile(file, lastUpdatedMillis))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous file is added by default from the base metadata during build, unless explicitly overwritten.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I know that this is being added during build() but IMO this is rather counter-intuitive as it's not clear what's being set explicitly and what's being set implicitly.

.withDefaultSpecId(newDefaultSpecId)
.withSpecs(builder.build())
.withLastAssignedPartitionId(Math.max(lastAssignedPartitionId, newPartitionSpec.lastAssignedFieldId()))
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit confusing when some stuff is set via builder methods and other stuff is set when build() is called (such as it's the case for previousFiles). IMO it would be more consistent when everything that needs to be set is set via builder methods. What could be still added to the build method is potential validation logic (if necessary)

lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, snapshot.snapshotId(), snapshots,
newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
return builderFrom(this)
.refreshLastUpdateMillis()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually need to be equal to nowMillis? because when we reach this part, the millis will be different for the SnapshotLogEntry and for TableMetadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will update them to match

@nastra
Copy link
Contributor

nastra commented Sep 16, 2021

@rdblue independently from this PR, have you guys thought about using Immutable annotations on things that should be immutable, which also gives you make-a-copy-of-X-and-update-fields-a-b-c for free?

@jackye1995
Copy link
Contributor Author

@nastra I think we do want to have very defined behavior and easy customization to the builder, that's why I avoid using builder classes like the one you shared.

@rdblue this makes sense, I will update this based on your suggestion.

@jackye1995
Copy link
Contributor Author

close in favor of #3664

@jackye1995 jackye1995 closed this Dec 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants