Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable TestFlinkTableSink#testHashDistributeMode until it's fixed by #2575 #3307

Closed

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Oct 18, 2021

No description provided.

@kbendick
Copy link
Contributor

Linking the issue here: #2575

@kbendick
Copy link
Contributor

Did I link the wrong thing? Issue 2525 doesn't seem to be for fixing it. Is there an issue for that already or do we need to make one?

I am not getting the error in my local env but I'll keep checking as I'm sure you're getting it.

@nastra
Copy link
Contributor Author

nastra commented Oct 19, 2021

@kbendick #2575 is for fixing the test flakiness imo. I am not getting the error locally as well, but out of the last 10 CI runs across different PRs, I've seen it fail at least 3 times

@openinx
Copy link
Member

openinx commented Oct 19, 2021

Linking the PR here: #3258

@openinx
Copy link
Member

openinx commented Oct 19, 2021

Thanks @nastra for bringing this up. I got a case that would break the flaky unit test from here:

In the unit test case, we are trying to write the following records into apache iceberg table by shuffling by partition field data (The parallelism is 2):

(1, 'aaa'), (1, 'bbb'), (1, 'ccc')
(2, 'aaa'), (2, 'bbb'), (2, 'ccc')
(3, 'aaa'), (3, 'bbb'), (3, 'ccc')

As we may produces multiple checkpoints when the streaming job is running, Then it's possible that we write the records in the following checkpoints:

  • checkpoint#1

    • (1, 'aaa')
    • (1, 'bbb')
    • (1, 'ccc')
  • checkpoint#2

    • (2, 'aaa'),
    • (2, 'bbb'),
    • (2, 'ccc')
    • (3, 'aaa'),
    • (3, 'bbb'),
    • (3, 'ccc')

Then it will produces a seperate data file for each partition in the given checkpoint. Let's say:

  • checkpoint#1

    • produces data-file-1 for partition aaa
    • produces data-file-2 for partition bbb
    • produces data-file-3 for partition ccc
  • checkpoint#2

    • produces data-file-4 for partition aaa
    • produces data-file-5 for partition bbb
    • produces data-file-6 for partition ccc

In the IcebergFilesCommitter , we will use the default MergeAppend to merge the manifest. Then we will produces:

  • checkpoint#1

    • manifest-1: includes data-file-1, data-file-2, data-file-3
  • checkpoint#2

    • manifest-2: includes data-file-1, data-file-2, data-file-3, data-file-4, data-file-5, data-file-6

Then finally, in this line, we will encounter the data-file1 twice in the result map. Finally, the assert would be failure.

I think we only need to find out the newly added data files for each given commit for fixing this unit test purpose.

@stevenzwu
Copy link
Contributor

stevenzwu commented Oct 19, 2021

checkpoint#2
manifest-2: includes data-file-1, data-file-2, data-file-3, data-file-4, data-file-5, data-file-6

@openinx why would the above happen? commit for checkpoint #1 failed and first set of data-file-[1-3] got committed along with the second set of data-file-[4-5]? if first commit is successful, 2nd commit shouldn't include data-file-[1-3] that are committed.

@szehon-ho
Copy link
Collaborator

szehon-ho commented Oct 22, 2021

@stevenzwu i think @openinx 's theory makes sense, when you merge manifest files on commit, I believe the entries for [data-file-1, data-file-2, data-file-3] has status: EXISTING, whereas new data files like [data-file-4], [data-file-5], [data-file-6] are ADDED.

My original thought was along this line when I made #2989, but missed that it was merging the manifest.

I was looking to see if we can adapt that code to take this into account, but realize ManifestEntry is not public class , like DataFile. There's no way to get whether a data-file is EXISTING or ADDED. Though entries table is exposed via Spark. Do you think we can expose it or maybe I'm missing another way to get this information. @rdblue any thoughts ? Thanks

@kbendick
Copy link
Contributor

@nastra Have you (or anybody else in this discussion) seen the flakiness of this test recently in CI (especially since the change to running tests separately)?

If so, I think it’s a good idea to disable the test ASAP and copy this conversation to an issue or elsewhere.

@rdblue
Copy link
Contributor

rdblue commented Oct 24, 2021

I strongly prefer fixing tests rather than disabling them. It sounds like we're close to a resolution here, although I can't say I understand the comments explaining the current theory. If that's the case, then let's find a way to fix the test.

@openinx
Copy link
Member

openinx commented Oct 25, 2021

Let me publish a PR today to address this issue today, sorry for the delay because of the my limited bandwidth (I should provide PR when I left the #3307 (comment) so that others don't need to prepare PR again).

@nastra
Copy link
Contributor Author

nastra commented Oct 25, 2021

definitely +1 for fixing the test. Closing this one then as it seems we're closing to having a fix with #2575

@nastra nastra closed this Oct 25, 2021
@nastra nastra deleted the disable-flink-testHashDistributeMode-test branch October 25, 2021 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants