Resolve test flakiness #864

MadelineMurray · 2020-05-07T05:49:15Z

There's been a significant increase in the number and frequency of ATs failing.

This looks to have started about April 22 and a couple of PRs have been made to try and fix this:

Improve reliability of acceptance tests #830 - merged this morning but flakiness continuing today
Disable native encryption in acceptance tests #835 - not yet merged but failing ATs when building

lucassaldanha · 2020-05-07T06:20:15Z

Lucas's crazy tests

So far I've tested a few things:

Running tests on a dedicated server

The idea was to isolate any CircleCI related issue. Eventually, we had failures due to timeout (30s).

Running the tests one at a time

The idea was to isolate any resource bottleneck. But eventually, we had tests failing due to timeout (30s).

Running tests with an increased timeout (240 seconds)

The idea was to check if the problem was related to the slow start of the nodes. Still, we had tests failing due to timeout.

Running tests on threads instead of independent java processes

The idea was to check if the problem was related to how we are running the nodes as java proceses. So I'm running them using the flag acctests.runBesuAsProcess = false. Eventually, we had failures due to timeout (30s).

Conclusion (pending)

So whatever we have here doesn't seem to be related to hardware power or a timeout period not long enough. I suspect it might have something to do with the way we are starting the nodes or using the resources. But so far I haven't managed to isolate a single variable that cause the tests to fail.

AbdelStark · 2020-05-07T06:24:51Z

I am checking if the problem is related to EIP-1559 in #865

Results in pull request 865

Attempt 1 : CI OK
Attempt 2 : CI KO, failed at unitTests step, Circle CI link (acceptanceTests are ok)
Attempt 3 : CI OK
Attempt 4 : CI OK
Attempt 5 : CI OK
Attempt 6 : CI OK

Results in current master branch

Attempt 1 : CI OK
Attempt 2 : CI OK
Attempt 3 : CI OK
Attempt 4 : CI OK
Attempt 5 : CI OK
Attempt 6 : CI OK

MadelineMurray · 2020-05-07T06:49:11Z

@abdelhamidbakhta - there was a RocksDB unit test failure on a test earlier today as well - https://circleci.com/gh/hyperledger/besu/15791

benjamincburns · 2020-05-07T21:57:34Z

@lucassaldanha @abdelhamidbakhta (cc @MadelineMurray) - you might try git bisect locally with multiple test runs on each checkout to see if you can isolate a commit that made things break

benjamincburns · 2020-05-07T21:59:31Z

If you aren't familiar with git bisect, I made a sort of tutorial video a while back: https://www.youtube.com/watch?v=QmBk_eYGGjE

hmijail · 2020-05-08T00:54:25Z

@lucassaldanha , just saw this so might be too late, but did you check for OOM messages in kernel logs when you tested in your own server?

shemnon · 2020-05-08T02:46:34Z

Do we need to explicitly turn off nat? I see the tests are auto-detecting Docker NAT, but unless we are testing NAT that seems like an unneeded variable.

node1 | 22:54:14.452 | Test worker | INFO | ProcessBesuNodeRunner | Creating besu process with params [build/install/besu/bin/besu, --data-path, /tmp/acctest4993954506636015820, --network, DEV, --discovery-enabled, true, --p2p-host, 127.0.0.1, --p2p-port, 0, --bootnodes, --rpc-http-enabled, --rpc-http-host, 127.0.0.1, --rpc-http-port, 0, --rpc-http-api, ETH,NET,WEB3, --rpc-http-authentication-enabled, --rpc-http-authentication-credentials-file, /home/circleci/project/acceptance-tests/tests/build/resources/test/authentication/auth.toml, --rpc-ws-enabled, --rpc-ws-host, 127.0.0.1, --rpc-ws-port, 0, --rpc-ws-api, ETH,NET,WEB3, --rpc-ws-authentication-enabled, --rpc-ws-authentication-credentials-file, /home/circleci/project/acceptance-tests/tests/build/resources/test/authentication/auth.toml, --Xp2p-check-maintained-connections-frequency, 60, --Xp2p-initiate-connections-frequency, 5, --Xsecp256k1-native-enabled=false, --Xaltbn128-native-enabled=false, --key-value-storage, rocksdb, --auto-log-bloom-caching-enabled, false]
[snip]
node1 | 2020-05-07 22:54:16.321+00:00 | main | INFO | ProtocolScheduleBuilder | Protocol schedule created with milestones: [ConstantinopleFix: 0]
node1 | 2020-05-07 22:54:16.415+00:00 | main | INFO | RunnerBuilder | Detecting NAT service.
node1 | 2020-05-07 22:54:16.930+00:00 | main | INFO | Runner | Starting Ethereum main loop ...
node1 | 2020-05-07 22:54:16.930+00:00 | main | INFO | DockerNatManager | Starting docker NAT manager.
node1 | 2020-05-07 22:54:16.931+00:00 | main | INFO | NetworkRunner | Starting Network.

build artifact

lucassaldanha · 2020-05-08T03:12:21Z

@lucassaldanha , just saw this so might be too late, but did you check for OOM messages in kernel logs when you tested in your own server?

No OOM errors.

lucassaldanha · 2020-05-08T03:13:09Z

Do we need to explicitly turn off nat? I see the tests are auto-detecting Docker NAT, but unless we are testing NAT that seems like an unneeded variable.

It might be worth trying it.

shemnon · 2020-05-08T06:24:50Z

#878 but we should see how it behaves around the clock.

shemnon · 2020-05-08T19:50:21Z

I got debug turned on for a couple of tests - https://app.circleci.com/pipelines/github/hyperledger/besu/3215/workflows/d178944a-b08c-4e4e-8728-38ff5906c259/jobs/16231/artifacts

My take is that peer discovery is failing. It looks like we give bootnodes only one chance. We can either juice the test cluster by adding bootnodes as static nodes, or add a logic piece to always try all bootnodes when peers are empty.

shemnon · 2020-05-08T19:56:58Z

Don't forget, peer discovery is on UDP and dropping UDP packets arbitrarily is 100% fair game. So the fact that the docker container would be bad keeping UDP packets alive doesn't mean it's the docker container's fault.

RatanRSur · 2020-05-13T13:05:38Z

from @shemnon :

For the CircleCI issues, what if instead of running acceptance tests in a fleet of docker instances we ran it on one or two bare metal boxes?

https://circleci.com/docs/2.0/executor-types/#using-machine

Not a 10 minute change I believe, so budget time accordingly if we do try this (ed

RatanRSur · 2020-05-18T20:55:45Z

After running 24h of straight acceptance tests on 1.4.2 and master (2020-05-15) we see a huge difference in failure rate. I'm now bisecting to find potential culprits and will update when I have that data.

hmijail · 2020-05-19T03:48:33Z

Regarding the dropping of UDP packets, Docker seems to consider such a thing a bug (and there are some workarounds). If there is evidence about that being (part of?) the problem I wouldn't mind digging into that myself.

RatanRSur · 2020-05-19T13:47:38Z

There is definitely reason to believe that's part of it, thank you for offering! Please do dig into that. @shemnon can probably answer some questions you run into because I haven't looked into that yet.

hmijail · 2020-05-25T01:09:23Z

I reported in proddev but repeating here to keep the info together:

I did some coarse testing for UDP disappeared packets in CircleCI containers, by dumping netstats -su at the end of tests. Very rarely I found evidence of a packet being sent but not received, and never in a failed test.
On the other hand, tests timing out seem to cluster in containers: if there is a timeout in a container, looks like it is more probable that following tests will timeout in that same container. For example, I just ran ATs in 8 parallel containers and most of the failed tests happened in 2 containers.
That again points to the possibility of /dev/random exhaustion causing blocking.

AbdelStark mentioned this issue May 7, 2020

Resolve test flakiness - EIP-1559 attempt #865

Closed

RatanRSur added the flake 60% of the time it works 100% of the time. label May 8, 2020

MadelineMurray assigned RatanRSur May 8, 2020

timbeiko added this to the Chupacabra Sprint 65 milestone May 20, 2020

RatanRSur closed this as completed Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve test flakiness #864

Resolve test flakiness #864

MadelineMurray commented May 7, 2020

lucassaldanha commented May 7, 2020 •

edited

Loading

AbdelStark commented May 7, 2020 •

edited

Loading

MadelineMurray commented May 7, 2020

benjamincburns commented May 7, 2020 •

edited

Loading

benjamincburns commented May 7, 2020

hmijail commented May 8, 2020

shemnon commented May 8, 2020

lucassaldanha commented May 8, 2020

lucassaldanha commented May 8, 2020

shemnon commented May 8, 2020

shemnon commented May 8, 2020

shemnon commented May 8, 2020

RatanRSur commented May 13, 2020

RatanRSur commented May 18, 2020

hmijail commented May 19, 2020

RatanRSur commented May 19, 2020

hmijail commented May 25, 2020

Resolve test flakiness #864

Resolve test flakiness #864

Comments

MadelineMurray commented May 7, 2020

lucassaldanha commented May 7, 2020 • edited Loading

Lucas's crazy tests

Running tests on a dedicated server

Running the tests one at a time

Running tests with an increased timeout (240 seconds)

Running tests on threads instead of independent java processes

Conclusion (pending)

AbdelStark commented May 7, 2020 • edited Loading

Results in pull request 865

Results in current master branch

MadelineMurray commented May 7, 2020

benjamincburns commented May 7, 2020 • edited Loading

benjamincburns commented May 7, 2020

hmijail commented May 8, 2020

shemnon commented May 8, 2020

lucassaldanha commented May 8, 2020

lucassaldanha commented May 8, 2020

shemnon commented May 8, 2020

shemnon commented May 8, 2020

shemnon commented May 8, 2020

RatanRSur commented May 13, 2020

RatanRSur commented May 18, 2020

hmijail commented May 19, 2020

RatanRSur commented May 19, 2020

hmijail commented May 25, 2020

lucassaldanha commented May 7, 2020 •

edited

Loading

AbdelStark commented May 7, 2020 •

edited

Loading

benjamincburns commented May 7, 2020 •

edited

Loading