Description
We changed retryCount
configuration option meaning in the scope of #167 (it lands into 1.9.2 release). In 1.9.1 it means overall attempts amount. In 1.9.2 it means amount of attempts to connect to one instance.
Now the cluster client tries to connect to one instance retryCount
times, then tries to connect to other instances. If it tries all instances and there was no luck, then the client dies (going into CLOSED state).
It is likely that a user will set considerably small connectionTimeout
and retryCount
to reconnect to another instance sooner if there is a problem with current one. However if there is a need to overcome significant downtimes / connectivity problems while save ability to fast change of an instance during its local problem, we need to change the algorithm somehow.
I have two possible variants:
- Add a configuration option that will allow to configure amount of cycles to connect to the whole cluster (now it is always 1).
- Change the order of connection attempts: try to connect a first instance one time, then the next one, etc in a loop until we'll try to connect each
retryCount
times.
Not sure it is good to change the order of attempts, because it is the user-visible behaviour, so I stick more with the first variant.
Several side notes.
This problem can be overcomed on a user side, however we have no ability to reconnect a died client (see #229), so a user will need to re-create a client. It would be good to handle this on our side to eliminate need of extra logic of a user side.
Re-creation of a client can lead to inability to connect if a user bootstraps the client from one instance and use cluster discovery to fetch others if the need to re-create the client occurs during a problem with the initial instance (or if the initial configuration was not updated at time). I think it worth to expose last cluster discovery result to give a user an ability to handle this problem if it want to re-create a cluster client anyway (issue TBD).