-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Surface better error message during streaming planning when checkpoint snapshot not found #6480
Spark: Surface better error message during streaming planning when checkpoint snapshot not found #6480
Conversation
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
69f12bc
to
7394d5e
Compare
Seems like surfacing a better error message is indeed the right fix for this situation based on a new issue which came up: #7340 I've updated with a test cc @aokolnychyi @RussellSpitzer @szehon-ho @singhpk234 @jackye1995 |
|
||
if (snapshot == null) { | ||
throw new IllegalStateException( | ||
String.format("Failed to find expected snapshot %d", currentOffset.snapshotId())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably improve this message. "Cannot load current offset at snapshot x, the snapshot was expired or removed"?
I'm trying to figure out how we can give the user a bit more information about what to do in this situation. Our currentOffset should always have a valid snapshot ID so I think it's probably ok to just say that that the snapshot was actually expired if we can't find it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it'll be helpful to provide the context that expiration or some external mechanism somehow removed the snapshot, will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, although I have a suggestion around the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, +1 for the comment from Russell.
7394d5e
to
2b499cd
Compare
…eckpoint snapshot not found
2b499cd
to
110b9d1
Compare
Looks like the message is updated, thanks Amogh! |
…n checkpoint snapshot not found (apache#6480) (cherry picked from commit 456c926)
…eckpoint snapshot not found (apache#6480)
…n checkpoint snapshot not found (apache#7381) This change backports PR apache#6480 to 3.3. Co-authored-by: Amogh Jahagirdar <jahamogh@amazon.com>
Fixes #6388.
Based on the stack trace the following sequence of events seems plausible.
1.) The snapshot ID for current offset no longer exists (my hunch is due to expiration). So table.snapshot(currentOffset.snapshotId()) returns null.
2.) Then planning throws an unclear NPE here when trying to get the operation associated with the snapshot.
This is possible when resuming reading from a checkpoint where the snapshot associated with the checkpoint was expired.