Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve stability of CI tests #878

Merged
merged 50 commits into from
Jun 1, 2020
Merged

Improve stability of CI tests #878

merged 50 commits into from
Jun 1, 2020

Conversation

shemnon
Copy link
Contributor

@shemnon shemnon commented May 8, 2020

  • Use non-blocking randomness for acceptance tests
  • wait 30 seconds instead of 2 when killing AT processes
  • use java scheduled executor timers instead of vert.x (everywhere, not just ATs).

Signed-off-by: Danno Ferrin danno.ferrin@gmail.com

* Set default nat mode to NONE instead of AUTO
* Ignore the flakey rocksdb unit test.

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
@shemnon shemnon mentioned this pull request May 8, 2020
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
* dump memory before shutting down cluster

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
* plumb option to CLI via -X command

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
@RatanRSur RatanRSur added the flake 60% of the time it works 100% of the time. label May 8, 2020
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
@RatanRSur RatanRSur assigned RatanRSur and unassigned RatanRSur May 11, 2020
@timbeiko timbeiko added this to the Chupacabra Sprint 65 milestone May 20, 2020
In both the threaded nodes and the harness set the entroy gathering
device URL to empty string so the default SecureRandom will be DRBG
using non-blocking randomness.

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
… circleci

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
shemnon added 12 commits May 27, 2020 11:24
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Oh, yea.  That was me.

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
timers.put(
id,
secheduledExecutor.scheduleAtFixedRate(
() -> vertx.executeBlocking(e -> handler.handle(), r -> {}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue to investigate here is whether this will cause threading issues. IIRC, VertexTimerUtil guarantees the timed callbacks happen in the calling thread, and I think we need that as the code is currently written ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the event loop or io thread in this case. Doing cryptography in the event loop or io thread could be a limiter to the number of concurrent peers we can service, especially when it could block on randomness.

While I haven't proven it yet I think this (too much processing, like walking a 4000 entry hash table) is what's causing our eth/65 issues, so getting stuff off of the io thread looks to be a long term goal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. We just need to make sure that the code we're executing with this timer is thread-safe.

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Copy link
Contributor

@RatanRSur RatanRSur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flakiness-master failures: 118/25 avg:4.72
refs/pull/878/head failures: 36/14 avg:2.5714285714285716

The burn-in testing on this looks good. When you drill into the logs I attached, you'll see that there's only one kind of failure in this branch and it's due to something @shemnon said he can roll back.
878-burn-in-results.zip

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
@shemnon
Copy link
Contributor Author

shemnon commented Jun 1, 2020

Because it's just a single test that is consistently flakey I @Ignored it and posted #1011 to fix it.

@RatanRSur
Copy link
Contributor

Ok, in that case let's add the "fixes ..." for all the flake labels so they can easily be found in the future.

https://github.com/hyperledger/besu/issues?q=is%3Aopen+is%3Aissue+label%3Aflake

@shemnon shemnon merged commit dac36a5 into hyperledger:master Jun 1, 2020
macfarla pushed a commit to macfarla/besu that referenced this pull request Jun 3, 2020
* Use non-blocking randomness for acceptance tests
  This addresses entropy draining during unit tests.
* wait 30 seconds instead of 2 when killing AT processes
* mark NodeSmartContractPermissioningIbftStallAcceptanceTest as @ignore since it has become reliably and specifically flakey.

Signed-off-by: Danno Ferrin <danno.ferrin@gmail.com>
Signed-off-by: Sally MacFarlane <sally.macfarlane@consensys.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake 60% of the time it works 100% of the time.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants