Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Implement RollbackToSnapshotProcedure #1759

Merged
merged 4 commits into from
Nov 13, 2020

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Nov 12, 2020

This PR implements a procedure to rollback to a given snapshot id.

Resovles #1592.

@github-actions github-actions bot added the spark label Nov 12, 2020
Snapshot secondSnapshot = table.currentSnapshot();

List<Object[]> output = sql(
"CALL %s.system.rollback_to_snapshot(snapshot_id => %dL, namespace => '%s', table => '%s')",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is %dL required or could it be %d? I think because of the casting that was added, it could be %d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not required, we have testRollbackToSnapshotWithQuotedIdentifiers for that. I don't think our cast logic is working there, though. I think Spark is smart enough to parse snapshot id as long since the value is too large to be int.

I'll add a test for the casting logic as soon as it applies.

return Identifier.of(namespace, nameParts.head());
}

private Seq<String> parseMultipartIdentifier(String identifierAsString) {
Copy link
Contributor

@rdblue rdblue Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in a private method, I don't think it is a good practice to use Scala Seq because of compatibility problems. It would be safer to pass this directly into a converter to get a List.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean change the return type to List?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Seq is not safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to return an array.

}

private interface ProcedureBuilder {
Procedure build(TableCatalog catalog);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange that the builder passes catalog to build, rather than withTableCatalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the overall approach or the method name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that there will be more options that will be passed to the builder, but there is just TableCatalog for now. I would expect this to support cases that don't need a TableCatalog, so it doesn't seem like something that we should tie to the build method. Instead, I'd use a withTableCatalog method and always build the procedure with build()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rework this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Let me know what you think, @rdblue.

long snapshotId = args.getLong(2);

return modifyIcebergTable(namespace, tableName, table -> {
Snapshot previousSnapshot = table.currentSnapshot();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to reason about concurrent operations on this table instance if caching is enabled. Here, there is a possibility there will be an operation in between we got previousSnapshot and before we committed so the output may not be precise. It may be even more important in other procedures later.

@rdblue, thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a validation to set requiredCurrentSnapshotId in the operation. Right now, we always roll back as long as it is an ancestor of the current state. Can be done as a follow-up though.

@Override
public T build() {
return doBuild();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make build abstract?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants