Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, Spark: Spark writes/actions should only perform cleanup if failure is cleanable #10373

Conversation

amogh-jahagirdar
Copy link
Contributor

Upstream callers of the transaction/snapshot producer APIs such as engines like Spark currently handle CommitStateUnknown exceptions to avoid cleaning up data files when it's unclear if the commit was successful or not. For example, see the rewrite data files procedure here: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java#L116

Currently, we only cleanup metadata files if strict cleanup is enabled (default true) and it's a cleanable failure. We should extend this to data file cleanups that upstream callers may do. To do that we can throw a commit state unknown exception in SnapshotProducer/BaseTransaction if strict cleanup is enabled and it's not a cleanable failure. Then engines won't go and cleanup data files.

@github-actions github-actions bot added the core label May 24, 2024
@amogh-jahagirdar
Copy link
Contributor Author

A few tests are failing now due to always throwing CommitStateUnknownException; I'll step through and see which exceptions should be marked as cleanable and which should be propagated as is. Glancing at the failures though it does seem like a good chunk should just be marked as cleanable

@danielcweeks
Copy link
Contributor

@amogh-jahagirdar this might cover a few cases where a RuntimeException is thrown, but I think the underlying problem is more that a lot of the implementations throw CommitFailedException or RuntimeExceptions for a wide variety of cases that may or may not be cleanable.

For example, I think JdbcTableOperations throws UncheckedSQLException for a lot of cases that are cleanable.

Overall, I'm in favor of deferring to not cleaning up, but I think due to the current exception handling, there are lots of edge cases.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented May 25, 2024

This might cover a few cases where a RuntimeException is thrown, but I think the underlying problem is more that a lot of the implementations throw CommitFailedException or RuntimeExceptions for a wide variety of cases that may or may not be cleanable.

This change currently handles all the RuntimeExceptions where they are not cleanable. I think CommitFailedException should always be cleanable. That's the way it's marked right now. If catalog implementations are throwing that when it's not cleanable, then I think those implementations should throw a CommitStateUnknown exception. In other words catalog implementations should only indicate cleanup when the commit state is known.

It probably helps to break down the exceptions in terms of intentions and whether or not the are cleanable. Then in the Java implementation we can establish a standard practice for catalog implementations. I don't think there are too many edge cases as long as there's an opinionated expectation as to what exceptions we expect Catalogs/internal APIs to throw.

CommitStateUnknownException -> The original exception in the reference implementation that would get thrown in case it's unclear if the write went through or not. Most cases surface this, and various engine integrations like Spark use this exception to make sure they don't clean up when they shouldn't.

CommitFailedException -> I think this should only ever be thrown due to conflicts, which by definition mean that we know the commit state is failed and we can cleanup. If implementations are throwing CommitFailedException when it's not clear it failed, I think those implementations should be fixed to throw CommitStatusUnknownException instead.

RuntimeException -> I think the expectation here should be RuntimeException which are explicitly Cleanable can be eagerly cleaned. For every other RuntimeException that gets surfaced, I think it should be assumed that the commit state is unknown and there should be no cleanup that gets performed. Sure there are cases we can miss an eager cleanup, in this model but we can get them safely with orphan file removal; missing an eager cleanup is a better alternative to possibly corrupting the table (I think we're on the same page there). That's the rationale behind this PR and the original metadata cleanup PR.

For example, I think JdbcTableOperations throws UncheckedSQLException for a lot of cases that are cleanable.

Yeah I think this fits into the RuntimeException case above. Right now we're missing possible eager cleanups here; if there are cases where it's cleanable, new exceptions can be introduced which implement CleanableFailure and those are thrown in the right conditions to indicate it's safe to cleanup. I think it's a good default for any RuntimeException which hasn't opted into being cleanable (in this case UncheckedSQLException), to skip cleanup.

@zackurey
Copy link

👍 to the conservative approach being taken here to not assume a failure is Cleanable unless explicitly categorized as such. I worry about the burden it puts on catalog implementations to categorize every error into either CommitStateUnknown or CommitStateFailed. We’ve encountered issues where edge cases manifest as generic RuntimeExceptions, which leads to inadvertent data file deletion. We run orphan file deletion frequently, so any data files that should have been reaped, but aren’t due to the change in behavior being introduced here, would have negligible COGs impact.

@stevenzwu
Copy link
Contributor

stevenzwu commented May 28, 2024

Upstream callers of the transaction/snapshot producer APIs such as engines like Spark currently handle CommitStateUnknown exceptions to avoid cleaning up data files when it's unclear if the commit was successful or not.

I like the introduction of the CleanableFailure marker interface for this purpose. Engines can potentially updated to reflect that intention. On the other hand, the challenge is that this is not defined in the spec. It may be less obvious to engine/catalog implementer. Some engines (like Spark) have learned to handle CommitStateUnknownException. Can we standardize on CleanableFailure for Java implementation? Can we add the behavior clarification to the spec?

Seems that this PR is trying to standardize uncleanable commit errors to CommiteStateUnknownException.

I like the safer approach of explicitly defining cleanable errors (instead of defining uncleanable errors). Engines/catalogs can handle cleanable errors explicitly.

@amogh-jahagirdar amogh-jahagirdar added this to the Iceberg 1.6.0 milestone Jun 10, 2024
@amogh-jahagirdar
Copy link
Contributor Author

Sorry for the delayed reply everyone, just getting back into this PR.

@stevenzwu

I like the introduction of the CleanableFailure marker interface for this purpose. Engines can potentially updated to reflect that intention. On the other hand, the challenge is that this is not defined in the spec. It may be less obvious to engine/catalog implementer. Some engines (like Spark) have learned to handle CommitStateUnknownException. Can we standardize on CleanableFailure for Java implementation? Can we add the behavior clarification to the spec?

I think it's a bit hard to add to the spec because the spec should be generalizable to all programming langauges/implementations, where a Java exception does not really fit that mold. However, I do like the idea of standardizing the exception usage in the Java implementation and the JVM engine integrations!

Seems that this PR is trying to standardize uncleanable commit errors to CommiteStateUnknownException.
I like the safer approach of explicitly defining cleanable errors (instead of defining uncleanable errors). Engines/catalogs can handle cleanable errors explicitly.

It is true that this PR is attempting to standardizing uncleanable commit errors to CommitStateUnknownException, but to be clear everything except CleanableFailure is considered CommitStateUnknownException which is intentional for guaranteeing safe cleanup.In the current implementation the standardization of uncleanable to CommitStateUnknownException was intentional since engines already handle that, and I was trying to avoid changing the different engine integrations. However, you bring up a good point that it may be better to go ahead and change the engines to explicitly handle CleanableFailure. I'll look into this!

@amogh-jahagirdar amogh-jahagirdar force-pushed the stop-cleaning-datafiles-when-unknown branch from f9fb642 to 244f3b6 Compare June 17, 2024 18:51
@github-actions github-actions bot added the spark label Jun 17, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the stop-cleaning-datafiles-when-unknown branch 8 times, most recently from 9fde7ed to 9a41550 Compare June 18, 2024 15:04
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jun 18, 2024

@stevenzwu I updated so that instead of normalizing non-cleanable failures to CommitStateUnknownExcpetion now engines, in this case Spark will explicitly handle CleanableFailure. It's a bit more implementation than the original approach since a few touchpoints in Spark integration needed to be updated but I think it makes more sense because in the original approach maybe there can be cases which still fail to actually get normalized to CommitStateUnknown exception (I think this is what @danielcweeks was getting at). Now it's explicit on the engine integration side that we only cleanup if it's a cleanable failure.

@amogh-jahagirdar amogh-jahagirdar requested review from stevenzwu, sumedhsakdeo, danielcweeks and rdblue and removed request for sumedhsakdeo June 18, 2024 20:25
@@ -355,8 +356,11 @@ private void replaceManifests(
// don't clean up added manifest files, because they may have been successfully committed.
throw commitStateUnknownException;
} catch (Exception e) {
// delete all new manifests because the rewrite failed
deleteFiles(Iterables.transform(addedManifests, ManifestFile::path));
if (e instanceof CleanableFailure) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to backport this to 3.4/3.3

@amogh-jahagirdar amogh-jahagirdar changed the title Core: Throw CommitStateUnknownException if RuntimeException that is not marked as cleanable is thrown Core, Spark: Spark writes/actions should only perform cleanup if failure is cleanable Jun 18, 2024

doReturn(util).when(spyRewrite).commitManager(table.currentSnapshot().snapshotId());

assertThatThrownBy(() -> spyRewrite.execute())
.as("Should fail entire rewrite if commit fails")
.isInstanceOf(RuntimeException.class)
.hasMessage("Commit Failure");
.hasMessageContaining("Cannot commit rewrite");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this error message change? The exception message wasn't changed above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this test to fail with CommitFailedException which is cleanable since the test prior to this change had an expectation that orphans were removed. When CommitFailedException gets thrown though, it ends up getting surfaced as a RuntimeException https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java#L317 with this error message "Cannot commit rewrite...". I've also just added another test to explicitly verify the non-cleanable case for rewriting data files.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me other than I don't see why the exception message changed in tests. That's minor so I wouldn't block this.

@amogh-jahagirdar amogh-jahagirdar force-pushed the stop-cleaning-datafiles-when-unknown branch from 9a41550 to 030df19 Compare June 20, 2024 03:09
@amogh-jahagirdar
Copy link
Contributor Author

Thanks for the review @rdblue , I'll go ahead and merge for the 1.6 milestone. @stevenzwu @sumedhsakdeo @danielcweeks just let me know if you see any concerns and I'll address them before the milestone.

@amogh-jahagirdar amogh-jahagirdar merged commit c67c912 into apache:main Jun 20, 2024
45 checks passed
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jun 20, 2024
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jun 20, 2024
@stevenzwu
Copy link
Contributor

instead of normalizing non-cleanable failures to CommitStateUnknownExcpetion now engines, in this case Spark will explicitly handle CleanableFailure

@amogh-jahagirdar thanks for making this change. sorry, I have been OOO for a while.

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024
sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants