Description
I tried to run multi node testing (jepsen-cluster
and jepsen-cluster-txm
workflows in tarantool) and found that it does not work.
There are a lot warnings of this kind:
WARN [2021-11-14 14:45:49,060] jepsen node 146.185.243.54 - jepsen.control Encountered error with conn [:control "146.185.243.54"]; reopening
java.lang.InterruptedException: sleep interrupted
That finally ends with:
CMake Error at cmake/atomic.cmake:46 (message):
C atomics not supported
Which points me to tarantool/tarantool#2088 and, it seems, means that those retries somehow lead to miss of the git submodule update --init --recursive
command and/or incomplete cmake <...>
commands.
The code that builds tarantool is the same for single node and multi node testing, so my guess is that it is a synchronization problem in the ssh connector implementation. There were relevant fixes in recent Jepsen versions, so we can try to update it and look, whether the problem will gone. See #30.
Full logs and artifacts:
- jepsen-cluster-logs.txt and jepsen-cluster.zip.
- jepsen-cluster-txm-logs.txt and jepsen-cluster-txm.zip.
Full logs from successful (single node) testing:
Tarantool's commit on which I run CI and got those logs.
As I see from tarantool/tarantool#5736 multi node testing was not enabled to save machine resources. I think we should enable it anyway, maybe just run rarely. Otherwise we'll meet surprises like this one without understanding what actually occurs.