Improve stability of CI tests #878

shemnon · 2020-05-08T05:20:50Z

Use non-blocking randomness for acceptance tests
wait 30 seconds instead of 2 when killing AT processes
use java scheduled executor timers instead of vert.x (everywhere, not just ATs).

Signed-off-by: Danno Ferrin danno.ferrin@gmail.com

* Set default nat mode to NONE instead of AUTO * Ignore the flakey rocksdb unit test. Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

* dump memory before shutting down cluster Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

* plumb option to CLI via -X command Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

In both the threaded nodes and the harness set the entroy gathering device URL to empty string so the default SecureRandom will be DRBG using non-blocking randomness. Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

… circleci Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Oh, yea. That was me. Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

...c/main/java/org/hyperledger/besu/ethereum/p2p/discovery/internal/IndirectVertxTimerUtil.java

mbaxter · 2020-05-28T13:38:20Z

...c/main/java/org/hyperledger/besu/ethereum/p2p/discovery/internal/IndirectVertxTimerUtil.java

+    timers.put(
+        id,
+        secheduledExecutor.scheduleAtFixedRate(
+            () -> vertx.executeBlocking(e -> handler.handle(), r -> {}),


One issue to investigate here is whether this will cause threading issues. IIRC, VertexTimerUtil guarantees the timed callbacks happen in the calling thread, and I think we need that as the code is currently written ...

That would be the event loop or io thread in this case. Doing cryptography in the event loop or io thread could be a limiter to the number of concurrent peers we can service, especially when it could block on randomness.

While I haven't proven it yet I think this (too much processing, like walking a 4000 entry hash table) is what's causing our eth/65 issues, so getting stuff off of the io thread looks to be a long term goal.

That makes sense. We just need to make sure that the code we're executing with this timer is thread-safe.

...c/main/java/org/hyperledger/besu/ethereum/p2p/discovery/internal/IndirectVertxTimerUtil.java

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

RatanRSur

flakiness-master failures: 118/25 avg:4.72
refs/pull/878/head failures: 36/14 avg:2.5714285714285716

The burn-in testing on this looks good. When you drill into the logs I attached, you'll see that there's only one kind of failure in this branch and it's due to something @shemnon said he can roll back.
878-burn-in-results.zip

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

shemnon · 2020-06-01T14:53:51Z

Because it's just a single test that is consistently flakey I @Ignored it and posted #1011 to fix it.

RatanRSur · 2020-06-01T15:08:58Z

Ok, in that case let's add the "fixes ..." for all the flake labels so they can easily be found in the future.

https://github.com/hyperledger/besu/issues?q=is%3Aopen+is%3Aissue+label%3Aflake

@ignore

* Use non-blocking randomness for acceptance tests This addresses entropy draining during unit tests. * wait 30 seconds instead of 2 when killing AT processes * mark NodeSmartContractPermissioningIbftStallAcceptanceTest as @ignore since it has become reliably and specifically flakey. Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com> Signed-off-by: Sally MacFarlane <sally.macfarlane@consensys.net>

Improve stability of CI tests

8086639

* Set default nat mode to NONE instead of AUTO * Ignore the flakey rocksdb unit test. Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

shemnon mentioned this pull request May 8, 2020

Resolve test flakiness #864

Closed

shemnon added 9 commits May 8, 2020 09:52

debug logging

5f5ef74

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

adjust log level in circleci config instead

d842135

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

* bump circleci to large

cddc48f

* dump memory before shutting down cluster Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

undo accidental commit inclusion

34df609

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

* Reduce table refresh interval for acceptance tests from 30->5 sec

ef629ce

* plumb option to CLI via -X command Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Merge branch 'master' of github.com:hyperledger/besu into circleci

960504d

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

fix build breakages

f03ee75

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

spotless

7af3e8c

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

don't have bootnodes refresh as often, make refresh just past round time

719c40a

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

RatanRSur added the flake 60% of the time it works 100% of the time. label May 8, 2020

shemnon added 8 commits May 8, 2020 17:03

move where we set refresh

f9f11b3

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

update jdk versions

3d2cac9

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

wrong toy story character

8dbf9c5

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

libsodium update

e2f8c5a

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

prefer endpoint port over udp port.

baa17aa

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

fix units error

0dcb736

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

spotless

6404c59

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

take vertx down to 0

9a1bb46

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

RatanRSur assigned RatanRSur and unassigned RatanRSur May 11, 2020

timbeiko added this to the Chupacabra Sprint 65 milestone May 20, 2020

shemnon added 7 commits May 23, 2020 13:33

Acceptance Tests non-blocking PRNG

ab821e3

In both the threaded nodes and the harness set the entroy gathering device URL to empty string so the default SecureRandom will be DRBG using non-blocking randomness. Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Merge branch 'acceptanceNoNativePRNG' of github.com:shemnon/besu into…

67a0536

… circleci Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

log cluster completed

dc8a9da

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

noisy secure random stats

22b8684

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

undo ignore test

365c4d1

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

log discovery packets sent

31b847e

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

make securerandom provider quieter.

25f62b1

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

shemnon added 12 commits May 27, 2020 11:24

more devp2p logging

84e466b

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

spotless

94bdbb1

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

not all packets are pings...

45af203

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

lotso tracing

26348f3

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

log timer calls

1f46d89

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

who put that really annoying errorprone check in?

463068c

Oh, yea. That was me. Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

another log.

bcd7904

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

and log stacktrace

9c6a3f2

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

vertx.io rollback

a9c616d

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

make it fail

91cd6ba

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

make it fail

ce9e619

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

fine, we'll make our own timers

8777ab1

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

shemnon force-pushed the circleci branch from acbe858 to 8777ab1 Compare May 28, 2020 05:19

shemnon added 2 commits May 27, 2020 23:40

remove excessive debugging

574eaee

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

undo code cleanup, so that elsewhere

77e6224

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

lucassaldanha reviewed May 28, 2020

View reviewed changes

...c/main/java/org/hyperledger/besu/ethereum/p2p/discovery/internal/IndirectVertxTimerUtil.java Outdated Show resolved Hide resolved

mbaxter reviewed May 28, 2020

View reviewed changes

RatanRSur reviewed May 28, 2020

View reviewed changes

...c/main/java/org/hyperledger/besu/ethereum/p2p/discovery/internal/IndirectVertxTimerUtil.java Outdated Show resolved Hide resolved

shemnon added 4 commits May 28, 2020 09:31

fix tpyos

0782b64

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Merge branch 'master' into circleci

028ea77

undo indirect timer

9f44951

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

Merge branch 'master' into circleci

5b6c644

RatanRSur approved these changes Jun 1, 2020

View reviewed changes

ignore newly flakey test

7d2a812

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>

shemnon merged commit dac36a5 into hyperledger:master Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve stability of CI tests #878

Improve stability of CI tests #878

shemnon commented May 8, 2020 •

edited

Loading

mbaxter May 28, 2020

shemnon May 28, 2020

mbaxter May 28, 2020

RatanRSur left a comment

shemnon commented Jun 1, 2020 •

edited

Loading

RatanRSur commented Jun 1, 2020

Improve stability of CI tests #878

Improve stability of CI tests #878

Conversation

shemnon commented May 8, 2020 • edited Loading

mbaxter May 28, 2020

Choose a reason for hiding this comment

shemnon May 28, 2020

Choose a reason for hiding this comment

mbaxter May 28, 2020

Choose a reason for hiding this comment

RatanRSur left a comment

Choose a reason for hiding this comment

shemnon commented Jun 1, 2020 • edited Loading

RatanRSur commented Jun 1, 2020

shemnon commented May 8, 2020 •

edited

Loading

shemnon commented Jun 1, 2020 •

edited

Loading