-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, Spark: Spark writes/actions should only perform cleanup if failure is cleanable #10373
Core, Spark: Spark writes/actions should only perform cleanup if failure is cleanable #10373
Conversation
A few tests are failing now due to always throwing CommitStateUnknownException; I'll step through and see which exceptions should be marked as cleanable and which should be propagated as is. Glancing at the failures though it does seem like a good chunk should just be marked as cleanable |
@amogh-jahagirdar this might cover a few cases where a RuntimeException is thrown, but I think the underlying problem is more that a lot of the implementations throw CommitFailedException or RuntimeExceptions for a wide variety of cases that may or may not be cleanable. For example, I think JdbcTableOperations throws UncheckedSQLException for a lot of cases that are cleanable. Overall, I'm in favor of deferring to not cleaning up, but I think due to the current exception handling, there are lots of edge cases. |
This change currently handles all the RuntimeExceptions where they are not cleanable. I think CommitFailedException should always be cleanable. That's the way it's marked right now. If catalog implementations are throwing that when it's not cleanable, then I think those implementations should throw a CommitStateUnknown exception. In other words catalog implementations should only indicate cleanup when the commit state is known. It probably helps to break down the exceptions in terms of intentions and whether or not the are cleanable. Then in the Java implementation we can establish a standard practice for catalog implementations. I don't think there are too many edge cases as long as there's an opinionated expectation as to what exceptions we expect Catalogs/internal APIs to throw. CommitStateUnknownException -> The original exception in the reference implementation that would get thrown in case it's unclear if the write went through or not. Most cases surface this, and various engine integrations like Spark use this exception to make sure they don't clean up when they shouldn't. CommitFailedException -> I think this should only ever be thrown due to conflicts, which by definition mean that we know the commit state is failed and we can cleanup. If implementations are throwing CommitFailedException when it's not clear it failed, I think those implementations should be fixed to throw CommitStatusUnknownException instead. RuntimeException -> I think the expectation here should be RuntimeException which are explicitly Cleanable can be eagerly cleaned. For every other RuntimeException that gets surfaced, I think it should be assumed that the commit state is unknown and there should be no cleanup that gets performed. Sure there are cases we can miss an eager cleanup, in this model but we can get them safely with orphan file removal; missing an eager cleanup is a better alternative to possibly corrupting the table (I think we're on the same page there). That's the rationale behind this PR and the original metadata cleanup PR.
Yeah I think this fits into the RuntimeException case above. Right now we're missing possible eager cleanups here; if there are cases where it's cleanable, new exceptions can be introduced which implement CleanableFailure and those are thrown in the right conditions to indicate it's safe to cleanup. I think it's a good default for any RuntimeException which hasn't opted into being cleanable (in this case UncheckedSQLException), to skip cleanup. |
👍 to the conservative approach being taken here to not assume a failure is |
I like the introduction of the Seems that this PR is trying to standardize uncleanable commit errors to I like the safer approach of explicitly defining cleanable errors (instead of defining uncleanable errors). Engines/catalogs can handle cleanable errors explicitly. |
Sorry for the delayed reply everyone, just getting back into this PR.
I think it's a bit hard to add to the spec because the spec should be generalizable to all programming langauges/implementations, where a Java exception does not really fit that mold. However, I do like the idea of standardizing the exception usage in the Java implementation and the JVM engine integrations!
It is true that this PR is attempting to standardizing uncleanable commit errors to CommitStateUnknownException, but to be clear everything except CleanableFailure is considered CommitStateUnknownException which is intentional for guaranteeing safe cleanup.In the current implementation the standardization of uncleanable to |
f9fb642
to
244f3b6
Compare
9fde7ed
to
9a41550
Compare
@stevenzwu I updated so that instead of normalizing non-cleanable failures to |
@@ -355,8 +356,11 @@ private void replaceManifests( | |||
// don't clean up added manifest files, because they may have been successfully committed. | |||
throw commitStateUnknownException; | |||
} catch (Exception e) { | |||
// delete all new manifests because the rewrite failed | |||
deleteFiles(Iterables.transform(addedManifests, ManifestFile::path)); | |||
if (e instanceof CleanableFailure) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to backport this to 3.4/3.3
|
||
doReturn(util).when(spyRewrite).commitManager(table.currentSnapshot().snapshotId()); | ||
|
||
assertThatThrownBy(() -> spyRewrite.execute()) | ||
.as("Should fail entire rewrite if commit fails") | ||
.isInstanceOf(RuntimeException.class) | ||
.hasMessage("Commit Failure"); | ||
.hasMessageContaining("Cannot commit rewrite"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this error message change? The exception message wasn't changed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this test to fail with CommitFailedException
which is cleanable since the test prior to this change had an expectation that orphans were removed. When CommitFailedException gets thrown though, it ends up getting surfaced as a RuntimeException https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java#L317 with this error message "Cannot commit rewrite...". I've also just added another test to explicitly verify the non-cleanable case for rewriting data files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me other than I don't see why the exception message changed in tests. That's minor so I wouldn't block this.
9a41550
to
030df19
Compare
Thanks for the review @rdblue , I'll go ahead and merge for the 1.6 milestone. @stevenzwu @sumedhsakdeo @danielcweeks just let me know if you see any concerns and I'll address them before the milestone. |
@amogh-jahagirdar thanks for making this change. sorry, I have been OOO for a while. |
Upstream callers of the transaction/snapshot producer APIs such as engines like Spark currently handle CommitStateUnknown exceptions to avoid cleaning up data files when it's unclear if the commit was successful or not. For example, see the rewrite data files procedure here: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java#L116
Currently, we only cleanup metadata files if strict cleanup is enabled (default true) and it's a cleanable failure. We should extend this to data file cleanups that upstream callers may do. To do that we can throw a commit state unknown exception in SnapshotProducer/BaseTransaction if strict cleanup is enabled and it's not a cleanable failure. Then engines won't go and cleanup data files.