Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307

nastra · 2021-10-18T13:32:49Z

No description provided.

kbendick · 2021-10-18T17:50:47Z

Linking the issue here: #2575

kbendick · 2021-10-19T04:16:10Z

Did I link the wrong thing? Issue 2525 doesn't seem to be for fixing it. Is there an issue for that already or do we need to make one?

I am not getting the error in my local env but I'll keep checking as I'm sure you're getting it.

nastra · 2021-10-19T06:24:22Z

@kbendick #2575 is for fixing the test flakiness imo. I am not getting the error locally as well, but out of the last 10 CI runs across different PRs, I've seen it fail at least 3 times

openinx · 2021-10-19T12:20:28Z

Linking the PR here: #3258

openinx · 2021-10-19T12:43:13Z

Thanks @nastra for bringing this up. I got a case that would break the flaky unit test from here:

In the unit test case, we are trying to write the following records into apache iceberg table by shuffling by partition field data (The parallelism is 2):

(1, 'aaa'), (1, 'bbb'), (1, 'ccc')
(2, 'aaa'), (2, 'bbb'), (2, 'ccc')
(3, 'aaa'), (3, 'bbb'), (3, 'ccc')

As we may produces multiple checkpoints when the streaming job is running, Then it's possible that we write the records in the following checkpoints:

checkpoint#1
- (1, 'aaa')
- (1, 'bbb')
- (1, 'ccc')
checkpoint#2
- (2, 'aaa'),
- (2, 'bbb'),
- (2, 'ccc')
- (3, 'aaa'),
- (3, 'bbb'),
- (3, 'ccc')

Then it will produces a seperate data file for each partition in the given checkpoint. Let's say:

checkpoint#1
- produces data-file-1 for partition aaa
- produces data-file-2 for partition bbb
- produces data-file-3 for partition ccc
checkpoint#2
- produces data-file-4 for partition aaa
- produces data-file-5 for partition bbb
- produces data-file-6 for partition ccc

In the IcebergFilesCommitter , we will use the default MergeAppend to merge the manifest. Then we will produces:

checkpoint#1
- manifest-1: includes data-file-1, data-file-2, data-file-3
checkpoint#2
- manifest-2: includes data-file-1, data-file-2, data-file-3, data-file-4, data-file-5, data-file-6

Then finally, in this line, we will encounter the data-file1 twice in the result map. Finally, the assert would be failure.

I think we only need to find out the newly added data files for each given commit for fixing this unit test purpose.

stevenzwu · 2021-10-19T15:30:22Z

checkpoint#2
manifest-2: includes data-file-1, data-file-2, data-file-3, data-file-4, data-file-5, data-file-6

@openinx why would the above happen? commit for checkpoint #1 failed and first set of data-file-[1-3] got committed along with the second set of data-file-[4-5]? if first commit is successful, 2nd commit shouldn't include data-file-[1-3] that are committed.

szehon-ho · 2021-10-22T01:21:44Z

@stevenzwu i think @openinx 's theory makes sense, when you merge manifest files on commit, I believe the entries for [data-file-1, data-file-2, data-file-3] has status: EXISTING, whereas new data files like [data-file-4], [data-file-5], [data-file-6] are ADDED.

My original thought was along this line when I made #2989, but missed that it was merging the manifest.

I was looking to see if we can adapt that code to take this into account, but realize ManifestEntry is not public class , like DataFile. There's no way to get whether a data-file is EXISTING or ADDED. Though entries table is exposed via Spark. Do you think we can expose it or maybe I'm missing another way to get this information. @rdblue any thoughts ? Thanks

kbendick · 2021-10-23T21:24:18Z

@nastra Have you (or anybody else in this discussion) seen the flakiness of this test recently in CI (especially since the change to running tests separately)?

If so, I think it’s a good idea to disable the test ASAP and copy this conversation to an issue or elsewhere.

rdblue · 2021-10-24T19:32:37Z

I strongly prefer fixing tests rather than disabling them. It sounds like we're close to a resolution here, although I can't say I understand the comments explaining the current theory. If that's the case, then let's find a way to fix the test.

openinx · 2021-10-25T01:28:37Z

Let me publish a PR today to address this issue today, sorry for the delay because of the my limited bandwidth (I should provide PR when I left the #3307 (comment) so that others don't need to prepare PR again).

nastra · 2021-10-25T06:39:17Z

definitely +1 for fixing the test. Closing this one then as it seems we're closing to having a fix with #2575

Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by a…

6bd66e2

…pache#2575

nastra mentioned this pull request Oct 18, 2021

Flakey flink unit tests TestFlinkTableSink#testHashDistributeMode #2575

Closed

github-actions bot added the flink label Oct 18, 2021

nastra requested a review from openinx October 18, 2021 13:34

nastra closed this Oct 25, 2021

nastra deleted the disable-flink-testHashDistributeMode-test branch October 25, 2021 06:39

openinx mentioned this pull request Oct 25, 2021

Flink: Fix the flaky test TestFlinkSink#testHashDistributeMode #3365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307

Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307

nastra commented Oct 18, 2021

kbendick commented Oct 18, 2021

kbendick commented Oct 19, 2021

nastra commented Oct 19, 2021

openinx commented Oct 19, 2021

openinx commented Oct 19, 2021 •

edited

Loading

stevenzwu commented Oct 19, 2021 •

edited

Loading

szehon-ho commented Oct 22, 2021 •

edited

Loading

kbendick commented Oct 23, 2021

rdblue commented Oct 24, 2021

openinx commented Oct 25, 2021

nastra commented Oct 25, 2021

Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307

Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307

Conversation

nastra commented Oct 18, 2021

kbendick commented Oct 18, 2021

kbendick commented Oct 19, 2021

nastra commented Oct 19, 2021

openinx commented Oct 19, 2021

openinx commented Oct 19, 2021 • edited Loading

stevenzwu commented Oct 19, 2021 • edited Loading

szehon-ho commented Oct 22, 2021 • edited Loading

kbendick commented Oct 23, 2021

rdblue commented Oct 24, 2021

openinx commented Oct 25, 2021

nastra commented Oct 25, 2021

openinx commented Oct 19, 2021 •

edited

Loading

stevenzwu commented Oct 19, 2021 •

edited

Loading

szehon-ho commented Oct 22, 2021 •

edited

Loading