Add support for multiple endpoints and failover#1596
Add support for multiple endpoints and failover#1596kragniz merged 26 commits intokragniz:masterfrom
Conversation
Since etcd is a distributed storage, it makes sense for a client library to be able to connect to all the nodes in the cluster, and to manage their current status without needing to be reinstantiated. This patch adds the ability for the client to be aware of the status of all nodes in the cluster, and - in case of failure - to mark each of them as temporarily failed. Individual requests will continue to raise an exception in case of a failed connection at this point, but it would be easy to allow retrying if it seems a good idea. What has been done in practice: * Added a 'Endpoint' class, a simple FSM to abstract a remote server. * Refactored the initialization of the library, adding a switch to allow failing over to another server in case of need. * Refactored self.channel to be a function returning a grpc channel to the first non-failed node, and all the Etcd3Client.*stub properties accordingly. * Refactored the _handle_error wrapper, made it part of the class, and split it between the version for generators and the one for normal returners for code simplicity
This was causing test failures
Based on https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html I believe we should never encounter a channel rendered unusable by a previous error
…sons other than GRPC errors
|
Hello @InvalidInterrupt! I'm really looking forward to seeing this pull request merged! Especially since it's been a subject for a long time it seems. Do you need help on something? I could probably participate. |
|
Hi @Encrypt, and thanks for offering your assistance! I think I'm almost done with the changes to the library itself; the last thing on my to-do list is make the client constructor raise an error if it's passed hostname and port along with an explicit list of endpoints. Otherwise, I just need to write some docs and try to add more tests. If you'd like to help, I think this is close enough to done that an initial code review would be helpful. Since I'm still working on tests, do point out any code you feel absolutely must be covered by a test. |
|
Hello @InvalidInterrupt! Thanks for your answer. I'll probably do a code review tomorrow if I have time! |
…endpoints Perhaps it would be better to move most Etcd3Client behavior into a new, public base named MultiEndpointEtcd3Client or similar, and make Etcd3Client keep the old initializer interface by converting the arguments into an Endpoint before calling super().__init__().
* Add docstrings for Etcd3Client and Endpoint * Add Endpoint autodoc to usage.rst
|
Commenting with an idea expressed in a commit message to make sure it is seen:
|
|
@kragniz Just realized I forgot to tag you |
kragniz
left a comment
There was a problem hiding this comment.
@InvalidInterrupt this is looking good, but I think your idea to create a MultiEndpointEtcd3Client would be useful. The constructor is already quite complicated, so I'd rather have one for each case.
| :type opts: dict, optional | ||
| """ | ||
|
|
||
| def __init__(self, host, port, secure=True, creds=None, time_retry=300.0, |
There was a problem hiding this comment.
300 seconds seems pretty long as a default, is there a particular reason for this length?
There was a problem hiding this comment.
I just kept what the previous author had used here. Since there's currently no built-in provision for jitter or exponential backoff, setting this to be a bit long may help mitigate any "thundering herds" a little. In the event that the user has set up multi-endpoint failover and there are still healthy nodes available, their application won't remain unavailable for those five minutes.
I'm not arguing for any particular value (I also think it could be a bit shorter), but figured I should present an argument for keeping it long.
etcd3/client.py
Outdated
| for response in snapshot_response: | ||
| file_obj.write(response.blob) | ||
|
|
||
| # Remove utility functions from class namespace |
There was a problem hiding this comment.
Let's not delete these - the _ is enough to stop people from depending on them
There was a problem hiding this comment.
My motivation for this change was that becoming a bound method doesn't work well with these functions' signatures (i.e. self will be passed in as payload). Perhaps marking them as staticmethod down here would be a better solution to that problem?
Regardless, I'll remove this logic for now.
… class" This reverts commit 58979ae.
* Turn Etcd3Client in MultiEndpointEtcd3Client, and move existing single-endpoint interface into a subclass retaining the old name. * Revert changes to client() * Autodoc MultiEndpointEtcd3Client
kragniz
left a comment
There was a problem hiding this comment.
This seems good, thanks for your work in getting this finished off
Built upon the work in #106.
This still needs docs. Ideally, I'll also be able to add tests of failover functionality with multiple running etcd services.