-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307
Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307
Conversation
Linking the issue here: #2575 |
Did I link the wrong thing? Issue 2525 doesn't seem to be for fixing it. Is there an issue for that already or do we need to make one? I am not getting the error in my local env but I'll keep checking as I'm sure you're getting it. |
Linking the PR here: #3258 |
Thanks @nastra for bringing this up. I got a case that would break the flaky unit test from here: In the unit test case, we are trying to write the following records into apache iceberg table by shuffling by partition field
As we may produces multiple checkpoints when the streaming job is running, Then it's possible that we write the records in the following checkpoints:
Then it will produces a seperate data file for each partition in the given checkpoint. Let's say:
In the IcebergFilesCommitter , we will use the default MergeAppend to merge the manifest. Then we will produces:
Then finally, in this line, we will encounter the I think we only need to find out the newly added data files for each given commit for fixing this unit test purpose. |
@openinx why would the above happen? commit for checkpoint #1 failed and first set of data-file-[1-3] got committed along with the second set of data-file-[4-5]? if first commit is successful, 2nd commit shouldn't include data-file-[1-3] that are committed. |
@stevenzwu i think @openinx 's theory makes sense, when you merge manifest files on commit, I believe the entries for [data-file-1, data-file-2, data-file-3] has status: EXISTING, whereas new data files like [data-file-4], [data-file-5], [data-file-6] are ADDED. My original thought was along this line when I made #2989, but missed that it was merging the manifest. I was looking to see if we can adapt that code to take this into account, but realize ManifestEntry is not public class , like DataFile. There's no way to get whether a data-file is EXISTING or ADDED. Though entries table is exposed via Spark. Do you think we can expose it or maybe I'm missing another way to get this information. @rdblue any thoughts ? Thanks |
@nastra Have you (or anybody else in this discussion) seen the flakiness of this test recently in CI (especially since the change to running tests separately)? If so, I think it’s a good idea to disable the test ASAP and copy this conversation to an issue or elsewhere. |
I strongly prefer fixing tests rather than disabling them. It sounds like we're close to a resolution here, although I can't say I understand the comments explaining the current theory. If that's the case, then let's find a way to fix the test. |
Let me publish a PR today to address this issue today, sorry for the delay because of the my limited bandwidth (I should provide PR when I left the #3307 (comment) so that others don't need to prepare PR again). |
definitely +1 for fixing the test. Closing this one then as it seems we're closing to having a fix with #2575 |
No description provided.