Skip to content

Add support for multiple endpoints and failover#1596

Merged
kragniz merged 26 commits intokragniz:masterfrom
InvalidInterrupt:reconnections
Jul 6, 2021
Merged

Add support for multiple endpoints and failover#1596
kragniz merged 26 commits intokragniz:masterfrom
InvalidInterrupt:reconnections

Conversation

@InvalidInterrupt
Copy link
Contributor

Built upon the work in #106.

This still needs docs. Ideally, I'll also be able to add tests of failover functionality with multiple running etcd services.

lavagetto and others added 19 commits April 7, 2018 17:08
Since etcd is a distributed storage, it makes sense for a client library
to be able to connect to all the nodes in the cluster, and to manage
their current status without needing to be reinstantiated.

This patch adds the ability for the client to be aware of the status of
all nodes in the cluster, and - in case of failure - to mark each of
them as temporarily failed.

Individual requests will continue to raise an exception in case of a
failed connection at this point, but it would be easy to allow retrying
if it seems a good idea.

What has been done in practice:

* Added a 'Endpoint' class, a simple FSM to abstract a remote server.
* Refactored the initialization of the library, adding a switch to allow
  failing over to another server in case of need.
* Refactored self.channel to be a function returning a grpc channel to
  the first non-failed node, and all the Etcd3Client.*stub properties
  accordingly.
* Refactored the _handle_error wrapper, made it part of the class, and
  split it between the version for generators and the one for normal
  returners for code simplicity
Based on https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html I believe we should never encounter a channel rendered unusable by a previous error
@Encrypt
Copy link

Encrypt commented Jun 2, 2021

Hello @InvalidInterrupt!

I'm really looking forward to seeing this pull request merged! Especially since it's been a subject for a long time it seems.

Do you need help on something? I could probably participate.

@InvalidInterrupt
Copy link
Contributor Author

Hi @Encrypt, and thanks for offering your assistance!

I think I'm almost done with the changes to the library itself; the last thing on my to-do list is make the client constructor raise an error if it's passed hostname and port along with an explicit list of endpoints. Otherwise, I just need to write some docs and try to add more tests.

If you'd like to help, I think this is close enough to done that an initial code review would be helpful. Since I'm still working on tests, do point out any code you feel absolutely must be covered by a test.

@Encrypt
Copy link

Encrypt commented Jun 3, 2021

Hello @InvalidInterrupt!

Thanks for your answer. I'll probably do a code review tomorrow if I have time!

…endpoints

Perhaps it would be better to move most  Etcd3Client behavior into a new, public base named MultiEndpointEtcd3Client or similar, and make Etcd3Client keep the old initializer interface by converting the arguments into an Endpoint before calling super().__init__().
* Add docstrings for Etcd3Client and Endpoint
* Add Endpoint autodoc to usage.rst
@InvalidInterrupt InvalidInterrupt marked this pull request as ready for review June 4, 2021 20:18
@InvalidInterrupt
Copy link
Contributor Author

Commenting with an idea expressed in a commit message to make sure it is seen:

Perhaps it would be better to move most Etcd3Client behavior into a new, public base named MultiEndpointEtcd3Client or similar, and make Etcd3Client keep the old initializer interface by converting the arguments into an Endpoint before calling super().__init__().

@InvalidInterrupt
Copy link
Contributor Author

@kragniz Just realized I forgot to tag you

Copy link
Owner

@kragniz kragniz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@InvalidInterrupt this is looking good, but I think your idea to create a MultiEndpointEtcd3Client would be useful. The constructor is already quite complicated, so I'd rather have one for each case.

:type opts: dict, optional
"""

def __init__(self, host, port, secure=True, creds=None, time_retry=300.0,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300 seconds seems pretty long as a default, is there a particular reason for this length?

Copy link
Contributor Author

@InvalidInterrupt InvalidInterrupt Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just kept what the previous author had used here. Since there's currently no built-in provision for jitter or exponential backoff, setting this to be a bit long may help mitigate any "thundering herds" a little. In the event that the user has set up multi-endpoint failover and there are still healthy nodes available, their application won't remain unavailable for those five minutes.

I'm not arguing for any particular value (I also think it could be a bit shorter), but figured I should present an argument for keeping it long.

etcd3/client.py Outdated
for response in snapshot_response:
file_obj.write(response.blob)

# Remove utility functions from class namespace
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not delete these - the _ is enough to stop people from depending on them

Copy link
Contributor Author

@InvalidInterrupt InvalidInterrupt Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation for this change was that becoming a bound method doesn't work well with these functions' signatures (i.e. self will be passed in as payload). Perhaps marking them as staticmethod down here would be a better solution to that problem?

Regardless, I'll remove this logic for now.

* Turn Etcd3Client in MultiEndpointEtcd3Client,
  and move existing single-endpoint interface
  into a subclass retaining the old name.

* Revert changes to client()

* Autodoc MultiEndpointEtcd3Client
@InvalidInterrupt InvalidInterrupt requested a review from kragniz July 3, 2021 19:07
Copy link
Owner

@kragniz kragniz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good, thanks for your work in getting this finished off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants