Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve test flakiness #864

Closed
MadelineMurray opened this issue May 7, 2020 · 17 comments
Closed

Resolve test flakiness #864

MadelineMurray opened this issue May 7, 2020 · 17 comments
Assignees
Labels
flake 60% of the time it works 100% of the time.

Comments

@MadelineMurray
Copy link
Contributor

There's been a significant increase in the number and frequency of ATs failing.

This looks to have started about April 22 and a couple of PRs have been made to try and fix this:

@lucassaldanha
Copy link
Member

lucassaldanha commented May 7, 2020

Lucas's crazy tests

So far I've tested a few things:

Running tests on a dedicated server

The idea was to isolate any CircleCI related issue. Eventually, we had failures due to timeout (30s).

Running the tests one at a time

The idea was to isolate any resource bottleneck. But eventually, we had tests failing due to timeout (30s).

Running tests with an increased timeout (240 seconds)

The idea was to check if the problem was related to the slow start of the nodes. Still, we had tests failing due to timeout.

Running tests on threads instead of independent java processes

The idea was to check if the problem was related to how we are running the nodes as java proceses. So I'm running them using the flag acctests.runBesuAsProcess = false. Eventually, we had failures due to timeout (30s).

Conclusion (pending)

So whatever we have here doesn't seem to be related to hardware power or a timeout period not long enough. I suspect it might have something to do with the way we are starting the nodes or using the resources. But so far I haven't managed to isolate a single variable that cause the tests to fail.

@AbdelStark
Copy link
Contributor

AbdelStark commented May 7, 2020

I am checking if the problem is related to EIP-1559 in #865

Results in pull request 865

  • Attempt 1 : CI OK
  • Attempt 2 : CI KO, failed at unitTests step, Circle CI link (acceptanceTests are ok)
  • Attempt 3 : CI OK
  • Attempt 4 : CI OK
  • Attempt 5 : CI OK
  • Attempt 6 : CI OK

Results in current master branch

  • Attempt 1 : CI OK
  • Attempt 2 : CI OK
  • Attempt 3 : CI OK
  • Attempt 4 : CI OK
  • Attempt 5 : CI OK
  • Attempt 6 : CI OK

@MadelineMurray
Copy link
Contributor Author

@abdelhamidbakhta - there was a RocksDB unit test failure on a test earlier today as well - https://circleci.com/gh/hyperledger/besu/15791

@benjamincburns
Copy link

benjamincburns commented May 7, 2020

@lucassaldanha @abdelhamidbakhta (cc @MadelineMurray) - you might try git bisect locally with multiple test runs on each checkout to see if you can isolate a commit that made things break

@benjamincburns
Copy link

If you aren't familiar with git bisect, I made a sort of tutorial video a while back: https://www.youtube.com/watch?v=QmBk_eYGGjE

@hmijail
Copy link
Contributor

hmijail commented May 8, 2020

@lucassaldanha , just saw this so might be too late, but did you check for OOM messages in kernel logs when you tested in your own server?

@shemnon
Copy link
Contributor

shemnon commented May 8, 2020

Do we need to explicitly turn off nat? I see the tests are auto-detecting Docker NAT, but unless we are testing NAT that seems like an unneeded variable.

node1 | 22:54:14.452 | Test worker | INFO | ProcessBesuNodeRunner | Creating besu process with params [build/install/besu/bin/besu, --data-path, /tmp/acctest4993954506636015820, --network, DEV, --discovery-enabled, true, --p2p-host, 127.0.0.1, --p2p-port, 0, --bootnodes, --rpc-http-enabled, --rpc-http-host, 127.0.0.1, --rpc-http-port, 0, --rpc-http-api, ETH,NET,WEB3, --rpc-http-authentication-enabled, --rpc-http-authentication-credentials-file, /home/circleci/project/acceptance-tests/tests/build/resources/test/authentication/auth.toml, --rpc-ws-enabled, --rpc-ws-host, 127.0.0.1, --rpc-ws-port, 0, --rpc-ws-api, ETH,NET,WEB3, --rpc-ws-authentication-enabled, --rpc-ws-authentication-credentials-file, /home/circleci/project/acceptance-tests/tests/build/resources/test/authentication/auth.toml, --Xp2p-check-maintained-connections-frequency, 60, --Xp2p-initiate-connections-frequency, 5, --Xsecp256k1-native-enabled=false, --Xaltbn128-native-enabled=false, --key-value-storage, rocksdb, --auto-log-bloom-caching-enabled, false]
[snip]
node1 | 2020-05-07 22:54:16.321+00:00 | main | INFO | ProtocolScheduleBuilder | Protocol schedule created with milestones: [ConstantinopleFix: 0]
node1 | 2020-05-07 22:54:16.415+00:00 | main | INFO | RunnerBuilder | Detecting NAT service.
node1 | 2020-05-07 22:54:16.930+00:00 | main | INFO | Runner | Starting Ethereum main loop ...
node1 | 2020-05-07 22:54:16.930+00:00 | main | INFO | DockerNatManager | Starting docker NAT manager.
node1 | 2020-05-07 22:54:16.931+00:00 | main | INFO | NetworkRunner | Starting Network.

build artifact

@lucassaldanha
Copy link
Member

@lucassaldanha , just saw this so might be too late, but did you check for OOM messages in kernel logs when you tested in your own server?

No OOM errors.

@lucassaldanha
Copy link
Member

Do we need to explicitly turn off nat? I see the tests are auto-detecting Docker NAT, but unless we are testing NAT that seems like an unneeded variable.

It might be worth trying it.

@shemnon
Copy link
Contributor

shemnon commented May 8, 2020

#878 but we should see how it behaves around the clock.

@shemnon
Copy link
Contributor

shemnon commented May 8, 2020

I got debug turned on for a couple of tests - https://app.circleci.com/pipelines/github/hyperledger/besu/3215/workflows/d178944a-b08c-4e4e-8728-38ff5906c259/jobs/16231/artifacts

My take is that peer discovery is failing. It looks like we give bootnodes only one chance. We can either juice the test cluster by adding bootnodes as static nodes, or add a logic piece to always try all bootnodes when peers are empty.

@shemnon
Copy link
Contributor

shemnon commented May 8, 2020

Don't forget, peer discovery is on UDP and dropping UDP packets arbitrarily is 100% fair game. So the fact that the docker container would be bad keeping UDP packets alive doesn't mean it's the docker container's fault.

@RatanRSur RatanRSur added the flake 60% of the time it works 100% of the time. label May 8, 2020
@RatanRSur
Copy link
Contributor

from @shemnon :

For the CircleCI issues, what if instead of running acceptance tests in a fleet of docker instances we ran it on one or two bare metal boxes?

https://circleci.com/docs/2.0/executor-types/#using-machine

Not a 10 minute change I believe, so budget time accordingly if we do try this (ed

@RatanRSur
Copy link
Contributor

After running 24h of straight acceptance tests on 1.4.2 and master (2020-05-15) we see a huge difference in failure rate. I'm now bisecting to find potential culprits and will update when I have that data.

@hmijail
Copy link
Contributor

hmijail commented May 19, 2020

Regarding the dropping of UDP packets, Docker seems to consider such a thing a bug (and there are some workarounds). If there is evidence about that being (part of?) the problem I wouldn't mind digging into that myself.

@RatanRSur
Copy link
Contributor

There is definitely reason to believe that's part of it, thank you for offering! Please do dig into that. @shemnon can probably answer some questions you run into because I haven't looked into that yet.

@timbeiko timbeiko added this to the Chupacabra Sprint 65 milestone May 20, 2020
@hmijail
Copy link
Contributor

hmijail commented May 25, 2020

I reported in proddev but repeating here to keep the info together:

I did some coarse testing for UDP disappeared packets in CircleCI containers, by dumping netstats -su at the end of tests. Very rarely I found evidence of a packet being sent but not received, and never in a failed test.
On the other hand, tests timing out seem to cluster in containers: if there is a timeout in a container, looks like it is more probable that following tests will timeout in that same container. For example, I just ran ATs in 8 parallel containers and most of the failed tests happened in 2 containers.
That again points to the possibility of /dev/random exhaustion causing blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake 60% of the time it works 100% of the time.
Projects
None yet
Development

No branches or pull requests

8 participants