-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Description
Today when configuring nodes for tests we use a variety of mechanisms to ensure that they can find each other and form a cluster. At least one of these mechanisms is racy, leading to occasional test failures (#29244), and few of them implement our documented recommendation:
that the unicast hosts list be maintained as the list of master-eligible nodes in the cluster.
The main obstacle to fixing this was the difficulty of dynamically configuring the unicast hosts list in tests.
- Move file-based discovery into core Elasticsearch (Move file-based discovery to core #33241)
- Stop using
MockUncasedHostProvider
in mostESIntegTestCase
s (Use file-based discovery not MockUncasedHostsProvider #33554, Use file-based discovery not MockUncasedHostsProvider (backport of #33554) #33658) - Use file-based discovery for forming clusters in
AbstractDisruptionTestCase
(Disc: Move AbstractDisruptionTC to filebased D. #34461) - Tidy up settings management in
AbstractDisruptionTestCase
, overridingnodeSettings
rather than making a separateNodeConfigurationSource
, and usingESIntegTestCase
to manageminimum_master_nodes
etc. (DISCOVERY: Cleanup AbstractDisruptionTestCase #34808) - Use file-based discovery for forming clusters of real nodes in REST tests (TESTS: Use File Based Discovery in REST Tests #34560)
- Prevent tests in 5.6 from binding to port 30210 (see below) (DISCOVERY: Exclude Port 30210 in 5.6 Branch #34955)
In the REST tests, we currently start up one seed node and then have this node as the single host in all other nodes' unicast lists, but this means that the cluster can sometimes form without the seed node knowing about it, and it then cannot ping any of the other nodes to join. Unfortunately these tests include BWC ones so we can only reasonably make this change in master
once 6.5 is released, or else we will be trying to configure nodes that do not natively support file-based discovery.
The racy one is the AbstractDisruption test case, which attempts to find a free port by binding to it, then releasing it, and then telling the node to use that, and for some reason port 30210 is occasionally in use by something else. It really does seem to be just this port that causes issues, and we don't want to backport all this work to 5.6, but we do want to avoid test failures like this in 5.6, so let's just avoid this particular port in that branch.