Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Nessie to 0.28.0, improve commit conflict detection #4594

Merged
merged 1 commit into from
May 4, 2022

Conversation

snazy
Copy link
Member

@snazy snazy commented Apr 20, 2022

Enhances Iceberg commit conflict detection by maintaining the commit-id from which a
table metadata has been loaded, to use it as the "expected hash" in a Nessie commit.

Makes Nessie specific properties available in TableMetadata properties:

  • nessie.content.id - the Nessie Content.getId()
  • nessie.commit.id - the commit ID used to retrieve the table metadata
  • nessie.reference.name - the reference name from which the table metadata has been loaded

Also fixes an issue via org.apache.iceberg.nessie.NessieTableOperations#loadTableMetadata
that caused too many "previous files", because the TableMetadata.buildFrom() assumed that
it's only called for ongoing modifications.

@@ -103,13 +123,14 @@ protected void doRefresh() {
"is no longer valid.", client.getRef().getName()), e);
}
String metadataLocation = null;
Reference reference = client.getRef().getReference();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming these changes were done to not do additional API calls. These changes won't be necessary after #4593

Copy link
Member Author

@snazy snazy Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Supplier.get() a couple of times isn't necessary either.

@@ -819,7 +827,7 @@ private Builder(TableMetadata base) {
this.startingChangeCount = changes.size();

this.snapshotLog = Lists.newArrayList(base.snapshotLog);
this.previousFileLocation = base.metadataFileLocation;
this.previousFileLocation = noPreviousFile ? null : base.metadataFileLocation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little bit annoying that this needs to be set, so I wonder whether it would be better to have a builder method that allows overriding this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to override that too for something I'm working on. But I have just used the builder's withMetadataLocation. In my case, the value of base.metadataFIleLocation is already null so just calling the function works fine.

@@ -173,7 +207,7 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) {
throw new CommitStateUnknownException(ex);
} catch (NessieNotFoundException ex) {
throw new RuntimeException(
String.format("Cannot commit: Reference '%s' no longer exists", client.getRef().getName()), ex);
String.format("Cannot commit: Reference '%s' no longer exists", updateableReference.getName()), ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would probably revert all of those changes as I don't think they are necessary

Copy link
Member Author

@snazy snazy Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Supplier.get() a couple of times isn't necessary either.

@snazy snazy force-pushed the nessie-0.27.0 branch 2 times, most recently from b4c5b88 to 0fd33b4 Compare April 21, 2022 08:50
Comment on lines 100 to 96
TableMetadata.Builder builder = TableMetadata.buildFrom(deserialized, true)
.setCurrentSchema(table.getSchemaId())
.setDefaultSortOrder(table.getSortOrderId())
.setDefaultPartitionSpec(table.getSpecId())
.withMetadataLocation(metadataLocation);
.withMetadataLocation(metadataLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to pass true in here. If base doesn't have that field set, it will just be null. Otherwise it will get overridden after this builder is called with .build().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that the buildFrom/Builder.<init>() sets the final field previousFileLocation, which is then used in Builder.build() to add a MetadataLogEntry. Builder.previousFileLocation and Builder.metadataLocation are different.

@nastra nastra requested a review from RussellSpitzer April 22, 2022 06:58
@snazy snazy force-pushed the nessie-0.27.0 branch 2 times, most recently from d36118d to bfc6655 Compare April 26, 2022 13:29
@snazy snazy changed the title Bump Nessie to 0.27.0, improve commit conflict detection Bump Nessie to 0.28.0, improve commit conflict detection Apr 26, 2022
@snazy snazy requested a review from nastra April 26, 2022 13:30
@@ -155,7 +155,7 @@ public static String toJson(TableMetadata metadata) {
}
}

private static void toJson(TableMetadata metadata, JsonGenerator generator) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to change accessibility on these methods, why is the custom JsonGenerator needed in the Nessie code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh
fair enough

TableMetadata.Builder builder = TableMetadata.buildFrom(TableMetadataParser.read(io(), metadataLocation))
TableMetadata deserialized;
if (table.getMetadata() != null) {
deserialized = TableMetadataParser.fromJson(io(), metadataLocation, table.getMetadata().getMetadata());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have this path in to request metadata out of a Nessie specific object which stores the metadata as a Json String that needs to be parsed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i see, the IcebergTable class has a JsonNode? And our current public method only accepts a string, it doesn't allow directly injecting the node. I kind of think we should probably stick to that "string" api rather that opening up another method to the public but I may be paranoid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not yet, but in the future, yes.)

It's already a JSON node tree (as a generic JsonNode) - so the text->nodes parsing already happened.
As I didn't want to serialize the JsonNode to a String to get a TableMetadata, I've updated the TableMetadataParser.

if (table.getMetadata() != null) {
deserialized = TableMetadataParser.fromJson(io(), metadataLocation, table.getMetadata().getMetadata());
} else {
deserialized = TableMetadataParser.read(io(), metadataLocation);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we just read from the location on disk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

@@ -755,6 +755,10 @@ public static Builder buildFrom(TableMetadata base) {
return new Builder(base);
}

public static Builder buildFrom(TableMetadata base, boolean noPreviousFile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new public api here needs a java doc, I'm not sure I follow what noPreviousFile is expected to do here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a "smart thing" in the builder. If it's present, then it add it as a history item. That behavior is okay, when buildFrom() is only used when it is used within a "modify TableMetadata operation". But the Nessie catalog only uses it to modify the TableMetadata as it is - so it's not a change, which requires a new history-item.
Without the noPreviousFile==true, a couple of catalog tests fail, because those assert on the number of history items.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snazy I've been thinking on this and I'm wondering why can't implement a builder like .withPreviousFileLocation to set that to previousFileLocation to null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WFM, updated the PR

JsonNode newMetadata;
try {
TokenBuffer tokenBuffer = new TokenBuffer(JsonUtil.factory().getCodec(), false);
TableMetadataParser.toJson(metadata, tokenBuffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could avoid this usage if we use the "toJson(metadata) method and then pass that into the token buffer. I know it's a bit more wasteful but we are only doing this once per operation and it let's us keep the toJson(metadata, generator) method package private. Again I may be being paranoid here but ideally we don't open up any more methods than we absolutely have to.

Enhances Iceberg commit conflict detection by maintaining the commit-id from which a
table metadata has been loaded, to use it as the "expected hash" in a Nessie commit.

Makes Nessie specific properties available in `TableMetadata` properties:
* `nessie.content.id` - the Nessie `Content.getId()`
* `nessie.commit.id` - the commit ID used to retrieve the table metadata
* `nessie.reference.name` - the reference name from which the table metadata has been loaded

Also fixes an issue via `org.apache.iceberg.nessie.NessieTableOperations#loadTableMetadata`
that caused too many "previous files", because the `TableMetadata.buildFrom()` assumed that
it's only called for ongoing modifications.
@snazy snazy force-pushed the nessie-0.27.0 branch from 55c3505 to fffb89f Compare May 2, 2022 12:28
@@ -173,7 +203,8 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) {
throw new CommitStateUnknownException(ex);
} catch (NessieNotFoundException ex) {
throw new RuntimeException(
String.format("Cannot commit: Reference '%s' no longer exists", client.getRef().getName()), ex);
String.format("Cannot commit: Reference '%s' no longer exists",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a formatting change

@@ -189,7 +220,8 @@ private boolean isSnapshotOperation(TableMetadata base, TableMetadata metadata)

private String buildCommitMsg(TableMetadata base, TableMetadata metadata) {
if (isSnapshotOperation(base, metadata)) {
return String.format("Iceberg %s against %s", metadata.currentSnapshot().operation(), tableName());
return String.format("Iceberg %s against %s", metadata.currentSnapshot().operation(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a formatting change

@RussellSpitzer RussellSpitzer merged commit eb1fd41 into apache:master May 4, 2022
@snazy snazy deleted the nessie-0.27.0 branch May 5, 2022 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants