-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: add TableMetadata update builder #2957
Conversation
Preconditions.checkNotNull(lastColumnId, "Last column ID not assigned for table metadata"); | ||
Preconditions.checkArgument(schemas != null && !schemas.isEmpty(), | ||
"Table metadata must have at least 1 schema"); | ||
Preconditions.checkNotNull(currentSchemaId, "Table metadata must have the current schema ID"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically we can calculate the schema, spec, partition field and sort order id using a max, but for all use cases I see, we always know the actual values of these IDs and can avoid the recomputation, so I just decide to throw exception if it is not set by the caller.
} | ||
|
||
public TableMetadata upgradeToFormatVersion(int newFormatVersion) { | ||
Preconditions.checkArgument(newFormatVersion <= SUPPORTED_TABLE_FORMAT_VERSION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic here is moved to the builder. What we are doing here is simple precondition checks and then directly run the constructor logic with the new format version, so the logic in the original PR to call upgradeToFormatVersion is unnecessary, and can be centralized to the build logic.
424796c
to
22df22e
Compare
lastSequenceNumber, lastUpdatedMillis, lastAssignedColumnId, currentSchemaId, schemas, defaultSpecId, specs, | ||
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentVersionId, | ||
snapshots, entries.build(), metadataEntries.build()); | ||
return TableMetadata.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of using the builder here instead of instantiating TableMetadata
directly? Looks like it made the method too long! I don't think that this needs to use the builder because there isn't anything to validate or change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benefit of a builder for this class would be using it to accumulate state as it is read, rather than keeping the local variables.
For example, rather than calling snapshots.add
and then calling withSnapshots(snapshots)
, the builder could help by exposing a addSnapshot
method. But before adding that, I think it's worth considering what the purpose of the builder is, and I don't think that it is to make parsing easier.
To me, the purpose of the builder is to make updating table metadata easier. That's what I initially introduced the high-level methods for, like TableMetadata.replaceCurrentSnapshot
. I think a builder pattern works better for implementing those methods since you can make multiple changes in the same builder instead of building a new TableMetadata
and calling a different high-level method on it.
So in the end, I don't think that I would update this just yet like I said in my first comment. And I would probably change the API of the builder in general. If you want to use a builder here, then maybe two builders are appropriate. One for building from existing table metadata and one for just constructing the TableMetadata
objects.
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; | ||
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; | ||
|
||
public class TableMetadataBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public? I would prefer not making the builder public for now.
formatVersion, TableMetadata.SUPPORTED_TABLE_FORMAT_VERSION); | ||
Preconditions.checkArgument(formatVersion >= baseFormatVersion, | ||
"Cannot downgrade v%s table to v%s", baseFormatVersion, formatVersion); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally prefer to structure builders so that validation is done as soon as possible. That keeps build
methods smaller. This could be moved into the withFormatVersion
method.
22df22e
to
601da70
Compare
@rdblue thanks for the review! Based on the conversation, I try to first introduce the builder for update only. To indicate the builder is for update only, I used name I was thinking about using the same builder for both update and new metadata, but there might be some unnecessary complications to mix those 2 together, because we need to set defaults for any null values when building new metadata, whereas we want to inherit whatever was in the base metadata without any defaults in an update. So my current stance is that we can introduce another builder later for new table metadata (when that is necessary), and evaluate if it is worth letting them sharing some common class or not. For the feature of accumulating states, I am leaving it out of this PR for now. I tried to implement it, but in update builder we need to recreate a list anyway for whatever we need to overwrite from the old metadata, so the benefit for building the list inside of builder is not obvious compared to building it outside and then overwrite the whole list. |
I was imagining the builder would work a little differently. There's definitely some value in the regular builder pattern where you can mostly copy fields and just make calls to change values. But I think adding a builder could bring more value if we actually moved the implementations of the That approach allows us to make more than one change without adding special handling for A second benefit that I'm interested in is tracking the high-level changes the led to a new Does that motivation make sense? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if everything that needs to be updated is explicitly set via builder methods as right now it's a mix of updating things via builder methods and implicitly setting stuff in build()
itself.
Also I'm not sure if in most cases withInputFile(null)
is missing from the builder calls
lastSequenceNumber, lastUpdatedMillis, lastColumnId, currentSchemaId, schemas, defaultSpecId, specs, | ||
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, | ||
currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis)); | ||
return builderFrom(this).generateUUID().build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also have a .withFile(null)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's null by default, is it worth adding it explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, not worth adding it explicitly. I was confused with baseFile
and file
newSchemaId, builder.build(), defaultSpecId, updatedSpecs, lastAssignedPartitionId, | ||
defaultSortOrderId, updatedSortOrders, properties, currentSnapshotId, snapshots, snapshotLog, | ||
addPreviousFile(file, lastUpdatedMillis)); | ||
return builderFrom(this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have a .withFile(null)
and .withPreviousFiles(addPreviousFile(file, lastUpdatedMillis))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous file is added by default from the base metadata during build, unless explicitly overwritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I know that this is being added during build()
but IMO this is rather counter-intuitive as it's not clear what's being set explicitly and what's being set implicitly.
.withDefaultSpecId(newDefaultSpecId) | ||
.withSpecs(builder.build()) | ||
.withLastAssignedPartitionId(Math.max(lastAssignedPartitionId, newPartitionSpec.lastAssignedFieldId())) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit confusing when some stuff is set via builder methods and other stuff is set when build()
is called (such as it's the case for previousFiles
). IMO it would be more consistent when everything that needs to be set is set via builder methods. What could be still added to the build
method is potential validation logic (if necessary)
lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, snapshot.snapshotId(), snapshots, | ||
newSnapshotLog, addPreviousFile(file, lastUpdatedMillis)); | ||
return builderFrom(this) | ||
.refreshLastUpdateMillis() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this actually need to be equal to nowMillis
? because when we reach this part, the millis will be different for the SnapshotLogEntry
and for TableMetadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will update them to match
@rdblue independently from this PR, have you guys thought about using Immutable annotations on things that should be immutable, which also gives you |
close in favor of #3664 |
continuation of #2887, as discussed with @rdblue, introduce a table metadata builder to cleanup the code.
I am separating the builder to a different class to avoid making the
TableMetadata
class too long. I had to add a few package-private getters but I think that's a reasonable trade off.If we agree on this implementation, I will also update tests to use the builder instead of the
TableMetadata
constructor.