Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException #3341

Merged
merged 7 commits into from
Nov 4, 2021

Conversation

sumangala-patki
Copy link
Contributor

@sumangala-patki sumangala-patki commented Aug 26, 2021

PR addresses transient failures in the following test classes:

  • ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running parallelly. To be marked for sequential-only run to avoid transient failure
  • ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count

@sumangala-patki
Copy link
Contributor Author

sumangala-patki commented Aug 26, 2021

TEST RESULTS

HNS Account Location: East US 2
NonHNS Account Location: East US 2, Central US

AppendBlob HNS-OAuth

[WARNING] Tests run: 106, Failures: 0, Errors: 0, Skipped: 1
[ERROR] Failures: 
[ERROR]   ITestAzureBlobFileSystemFileStatus.testLastModifiedTime:144->Assert.assertTrue:42->Assert.fail:89 lastModifiedTime should be before createEndTime
[ERROR] Errors: 
[ERROR]   ITestAzureBlobFileSystemLease.testTwoWritersCreateAppendNoInfiniteLease:178->twoWriters:166 » IO
[ERROR] Tests run: 557, Failures: 1, Errors: 1, Skipped: 98

HNS-OAuth

[WARNING] Tests run: 106, Failures: 0, Errors: 0, Skipped: 1
[ERROR] Failures: 
[ERROR]   ITestAzureBlobFileSystemFileStatus.testLastModifiedTime:144->Assert.assertTrue:42->Assert.fail:89 lastModifiedTime should be before createEndTime 
[ERROR] Tests run: 557, Failures: 1, Errors: 0, Skipped: 98

HNS-SharedKey

[WARNING] Tests run: 106, Failures: 0, Errors: 0, Skipped: 2
[ERROR] Failures: 
[ERROR]   ITestAzureBlobFileSystemFileStatus.testLastModifiedTime:144->Assert.assertTrue:42->Assert.fail:89 lastModifiedTime should be before createEndTime
[ERROR] Tests run: 557, Failures: 1, Errors: 0, Skipped: 54

NonHNS-SharedKey

[WARNING] Tests run: 106, Failures: 0, Errors: 0, Skipped: 2
[ERROR] Failures: 
[ERROR]   ITestAzureBlobFileSystemFileStatus.testLastModifiedTime:144->Assert.assertTrue:42->Assert.fail:89 lastModifiedTime should be before createEndTime
[ERROR] Tests run: 557, Failures: 1, Errors: 0, Skipped: 276

JIRAs for failures: ITestAzureBlobFileSystemLease, testLastModifiedTime

@apache apache deleted a comment from hadoop-yetus Aug 30, 2021
@steveloughran
Copy link
Contributor

happy with all the code changes. Now, with the resetting, can the changes to the pom be removed? It isn't needed to fix this problem.

@sumangala-patki
Copy link
Contributor Author

Regarding the streamStats test pom fix:
statistics is a static member of filesystem class and so common for tests. Checked by printing hashcodes of the instance returned by createfilesystem().getFsStatistics(): they are the same across different tests/test classes run in parallel
we do reset currently during test run but it hasn't helped, so marking in pom for sequential seems to be the only resort

@sumangala-patki sumangala-patki marked this pull request as ready for review August 31, 2021 10:08
@apache apache deleted a comment from hadoop-yetus Aug 31, 2021
@steveloughran
Copy link
Contributor

they are the same across different tests/test classes run in parallel
we do reset currently during test run but it hasn't helped, so marking in pom for sequential seems to be the only resort

I think you are a bit confused about what parallel/sequential means. I stated this a few days ago; nothing has changed in Maven since then.

  1. In the parallel test run phase, Maven surefire/failsafe spawns multiple processes to execute test suite in parallel.
  2. Within a process, tests are run sequentially.

The issue is not parallel execution, the issue is state persisting from previous runs.

The reason things work in sequential & not parallel is the <reuseForks>false</reuseForks> flag. That's all. it has nothing to do with parallel test runs -not directly. Indirectly it could be due to something like: the fact tests take longer to run, or that the destination store is updated by other processes.

Before moving into sequential phase, and so making test runs slower, can we work out why the statistics.reset(); calls aren't working. As I keep saying: it is not failing from parallel tests running.

Meaning it could be one of

  • stats references not binding properly
  • counters double/under-counting in some circumstances.
  • state of remote store affecting results.
  • something remaining active in the process
  • reset() not working.

I think we need to understand more before committing this -otherwise it's not so much a fix for the problem as a "we changed this and it went away" kind of PR.

It probably is just some state left over from the previous test run -but there is a risk of it being something real.

@sumangala-patki
Copy link
Contributor Author

I agree with the fact that any given process executes tests in a sequential manner.

Some findings:
The failure can be consistently reproduced by adding a dummy test (in the same class) to call the existing test, and running the full test suite. It could be verified that each of the two tests has the correct number of invocations of the read/write op increment function for the respective paths, and no extra updates. Moreover, the statistics reset method works; all read/write operation counts were verified to be 0 right after reset was called, which is before each section (small file, large file). This would rule out left-over states and reset issues. As stats updates happen within the driver before store calls to read, response from the remote store will not affect the values.
Therefore, this instance of test run should have passed considering no interference between statistics reset and the value assertion, and with only the correct number of operation increments.

For one failing scenario:
Expected value for large file read op count: 102 or 103
Actual value in streamOps test: 99
Actual value in dummy test: 198
Value according to logs for each test: 103

Therefore, one way this could have happened is that the two tests (possibly along with any other test class involving read) were running in different processes, but around the same time. This resulted in these tests modifying the same statistics variable, which could also explain the drop in read count despite the test having executed the expected number of read ops - the statistics reset was called in one test while the other test was in the middle of executing read.
Hence, we can conclude that any test running in parallel processes along with the stats test may affect this test if it performs read/write. To avoid this scenario, we can introduce an additional filesystem level statistics variable that is not static, apart from the current static one that records operations globally from all filesystems created in a session.

@steveloughran
Copy link
Contributor

where is this statistics variable? It is some counter in the remote FS, or a field in the process running JUnit?

@steveloughran
Copy link
Contributor

OK, I have now discovered something while trying to track down why test running of the abfs integration tests in MAPREDUCE-7341 fail.

the hadoop-azure pom does request parallel method execution

<parallel>both</parallel>

I think this means that different processes do run different methods in the parallel phase.

This actually breaks anything which expects static state to be shared across different threads in the same test case, and also means that tests where @FixMethodOrder(MethodSorters.NAME_ASCENDING) is used to fix the method order

Turning this feature off gets those tests to work.

try commenting it out and see what difference it makes for you.

I would like to change it to the same that other parallel runs have: all methods in a TestCase is run in the same process, and after that test case has finished, the process ends.

I'll make it a separate change from my PR, but will include it my PR as well, for completeness.

(thought it was some junit5 thing; not so)

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now I know that abfs test run really does parallelise methods in same suite and process (which is fast but brittle), you do need to isolate this.

all good apart from the . to keep javadocs quiet

public static void ResetStatusToFirstTokenFetch() {
/**
* Clear earlier retry details and reset RetryTestTokenProvider instance to
* state of first access token fetch call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

When parallel is set to "classes" (instead of "both"), the StreamStats test passes even with the failure scenario induced by dummy test, as the two tests are run sequentially in the process. However, I guess there might be occasional failures if a different class reads/writes simultaneously in a different process; could not reproduce failure though.

Have made a minor correction to the pom file; the test will be excluded from the parallel run (currently classesandMethods/both) and executed separately along with a bunch of other integration tests that need to be run sequentially

@apache apache deleted a comment from hadoop-yetus Sep 22, 2021
@steveloughran
Copy link
Contributor

This is ready to go in, with one little change needed

javadocs in RetryTestTokenProvider.java to add "." at the end of the first sentence, even if there's only one sentence in the docs.

some versions of javadoc completely break here, so it's not just a little style thing, it's a "stop the build failing" change

thanks

@sumangala-patki
Copy link
Contributor Author

Hi @steveloughran, I had added the "." post your earlier review and can see it in the PR. Could you please check again if it's been updated?

@apache apache deleted a comment from hadoop-yetus Nov 4, 2021
@apache apache deleted a comment from hadoop-yetus Nov 4, 2021
@apache apache deleted a comment from hadoop-yetus Nov 4, 2021
@apache apache deleted a comment from hadoop-yetus Nov 4, 2021
@apache apache deleted a comment from hadoop-yetus Nov 4, 2021
@steveloughran steveloughran merged commit 82658a2 into apache:trunk Nov 4, 2021
asfgit pushed a commit that referenced this pull request Nov 4, 2021
…cs and ITestAbfsRestOperationException (#3341)

Addresses transient failures in the following test classes:

* ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running in parallel. Marked for sequential-only run to avoid transient failure

* ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count

closes #3341

Contributed by Sumangala Patki

Change-Id: Ied4dec35c81e94efe5f999acae4bb8fde278202e
asfgit pushed a commit that referenced this pull request Nov 5, 2021
…Statistics and ITestAbfsRestOperationException (#3341)"

This reverts commit 0379aeb.
asfgit pushed a commit that referenced this pull request Nov 5, 2021
…Statistics and ITestAbfsRestOperationException (#3341)"

This reverts commit 82658a2.
@steveloughran
Copy link
Contributor

afraid i had to revert this change as the build now insists that the @VisibleForTesting annotation refers to the hadoop one.
can you create a new PR with the updated import?

steveloughran pushed a commit that referenced this pull request Sep 6, 2022
…cs and ITestAbfsRestOperationException (#3699)


Successor for the reverted PR #3341, using the hadoop @VisibleForTesting attribute

Contributed by Sumangala Patki
asfgit pushed a commit that referenced this pull request Sep 6, 2022
…cs and ITestAbfsRestOperationException (#3699)

Successor for the reverted PR #3341, using the hadoop @VisibleForTesting attribute

Contributed by Sumangala Patki
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…cs and ITestAbfsRestOperationException (apache#3341)


Addresses transient failures in the following test classes:

* ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running in parallel. Marked for sequential-only run to avoid transient failure

* ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count

closes apache#3341

Contributed by Sumangala Patki
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…Statistics and ITestAbfsRestOperationException (apache#3341)"

This reverts commit 82658a2.
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…cs and ITestAbfsRestOperationException (apache#3699)


Successor for the reverted PR apache#3341, using the hadoop @VisibleForTesting attribute

Contributed by Sumangala Patki
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants