Add ability to pass multiple endpoints, failover capability by lavagetto · Pull Request #106 · kragniz/python-etcd3

lavagetto · 2017-04-02T16:48:07Z

This two commit add two features:

Ability to define a list of endpoints, pass them to the client library, have it pick one at random
Upon a connection failure, one server is marked as temporarily down and subsequent requests are made to non-failed servers. This allows i.e. to let any service running without restarts during maintenance windows when one etcd server is either rebooted or just the service is restarted.

This is still a WiP (as you might have noticed, no docs and no additional tests, sorry but I really can only work on this on sundays :/), but I wanted to get some feedback on the direction I've taken with the code, given it is a non-negligible overhaul.

The way I built the reconnection logic might seem a bit of an overkill, but my experience with trying to build a simpler model for the etcd2 python library taught me otherwise. I tried to learn from that experience.

codecov-io · 2017-04-02T17:48:38Z

Codecov Report

Merging #106 into master will increase coverage by 0.26%.
The diff coverage is 95.68%.

@@            Coverage Diff             @@
##           master     #106      +/-   ##
==========================================
+ Coverage   92.86%   93.13%   +0.26%     
==========================================
  Files          10       10              
  Lines         617      714      +97     
==========================================
+ Hits          573      665      +92     
- Misses         44       49       +5

Impacted Files	Coverage Δ
etcd3/__init__.py	`100% <100%> (ø)`	⬆️
etcd3/exceptions.py	`100% <100%> (ø)`	⬆️
etcd3/client.py	`93.7% <95.58%> (+0.27%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27f7335...df3bcac. Read the comment docs.

kragniz · 2017-04-03T19:57:13Z

Hey, I've only skimmed over your patch, but it looks great so far!

lavagetto · 2017-04-10T06:42:19Z

@kragniz as expected, I had a little time to work on this on sunday, and I found a few small issues with my code, that I am still fixing. It might take another week or two (next sunday is Easter here).

lavagetto · 2017-06-16T08:10:05Z

I'll look into the test coverage to make it better (I think the whole SRV DNS part is still not tested), but I consider the patch in a reviewable state now.

jd · 2017-06-16T08:32:32Z

Does this cover #150 also?
I'll try to review this patch ASAP.

lavagetto · 2017-06-16T09:12:59Z

No, it currently doesn't, but 1 - it's trivial to Support 2 - if you set reconnect to True, the library will autodiscover active members G. Il ven 16 giu 2017, 10:32 Julien Danjou <notifications@github.com> ha scritto:

…

Does this cover #150 <#150> also? I'll try to review this patch ASAP. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#106 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABaDYD5HgMZDDsQ5KsvoJSmu4TCn1Zumks5sEj2hgaJpZM4Mw21h> .

lavagetto · 2017-06-18T16:26:49Z

I think this patch is ready to be reviewed/merged, but please let me know if you want me to improve something.

jd

Here's a bunch of review. I feel bad that this is only one patch when there is actually 3 differents pieces:

Handle jumping between multiple servers
Doing a SRV requests to gather the initial server(s)
Read the cluster topology to jump between servers

It'd have be much better to do 3 different incremental patches.

Thank you for doing that anyway! That looks promising.

jd · 2017-06-21T08:53:03Z

etcd3/client.py

+        self.host = host
+        self.netloc = "{host}:{port}".format(host=host, port=port)
+        self.secure = secure
+        self.protocol = 'https' if secure else 'http'


why http/https? there's no http to me here.

that's how etcd itself refers to its endpoints in the configuration. I checked and in fact there is still no official designation for the grpc protocol in terms of URI names. So I went with what coreos does.

jd · 2017-06-21T08:54:14Z

etcd3/client.py

+    """Represents an etcd cluster member."""
+
+    creds = None
+    time_retry = 300.0


This ought to be configurable.

Agreed, it can be configured simply by setting

etcd3.Node.time_retry = X

Do you have something else in mind?

Well that's not obvious in the API since I don't expect users to go change every Node object that is stored in the Client object. And changing a global class is a bad idea.

jd · 2017-06-21T09:09:09Z

etcd3/client.py

+            for node in available:
+                # A node might have failed in the meanwhile
+                if node.failed:
+                    next


I don't think this does what you think it does

>>> next <built-in function next>

You meant continue I imagine

meh, I fixed this already but didn't commit it.

jd · 2017-06-21T09:10:10Z

etcd3/client.py

+        else:
+            available = [m for m in self._nodes_cache.values()
+                         if not m.failed]
+            for node in available:


No need to build a list in available, just iterate on self._nodes_cache.values() and add a if on L221. This double iteration looks useless.

yeah it's a leftover from a previous version of the code

jd · 2017-06-21T09:14:55Z

etcd3/client.py

-        self._url = '{host}:{port}'.format(host=host, port=port)
+    def __init__(self, host='localhost', port=2379, srv_record=None,
+                 ca_cert=None, cert_key=None, cert_cert=None,
+                 timeout=None, reconnect=False):


From what I understand, reconnect is a misnomer. It's not reconnection that is handled but jumping from one node to another.

I agree but I couldn't come up with anything better: if you have a better name, I'm happy to use it!

allow_jump ?

jd · 2017-06-21T09:30:18Z

etcd3/client.py

                        cert_cert_file.read()
                    )

+    def _discover(self, srv_name):


It'd make it clearer if you switch this method to be a @staticmethod. Just pass uses_secure_channel as an arg.

jd · 2017-06-21T09:30:49Z

etcd3/client.py

+                self.uses_secure_channel
+            )
+            hosts[node.netloc] = node
+        if not len(hosts):


This'd be better be raised by the caller of _discover().

jd · 2017-06-21T09:32:22Z

etcd3/client.py

+    def watchstub(self):
+        return etcdrpc.WatchStub(self.channel)
+
+    def _get_cluster_topology(self):


I see 2 problems with that method:

It's messing with _nodes_cache, and I don't like that. I'd prefer to it to return a new list of nodes that can be used by the caller

It's only called at __init__ time. What if the topology changes later?

1 - I don't really see why it's a problem, the whole point of that method is to mess with _nodes_cache, but I'm ok with doing it within the caller.

2 - In python-etcd (the v2 client) we ran a similar function every time we had one connection failing: we would fail the node, try the call again to another node, and if successful, also upgrade the cluster topology to check if it changed. I thought it was too much to add that as well in a single patch, and there are some drawbacks from that approach too.

Then just remove that from that patch. You don't need. We can discuss this in another patch/PR.

lavagetto · 2017-06-21T09:55:51Z

Thanks for the review @jd I'll try to fix these up ASAP.

jordimarinvalle · 2017-06-29T13:31:14Z

@lavagetto great work. I was going to develop it, I really need it and this library works with v3. I would like to know when it is going to be merged... I hope soon!

Gollam · 2017-08-29T09:06:59Z

any updates?

davissp14 · 2017-09-15T13:28:43Z

Ability to fetch the cluster topology on initialization, and store it in a cache. Upon a connection failure, one server is marked as temporarily down and subsequent requests are made to non-failed servers. This allows i.e. to let any service running without restarts during maintenance windows when one etcd server is either rebooted or just the service is restarted.

Many people don't directly connect to their members, but instead front their cluster with gRPC proxy(s), HAProxy(s), etc. When this is the case, we wouldn't want to auto-discover the cluster endpoints. I think it may be worth considering simply offering users the ability to define multiple endpoints on initialization and rotate endpoints on failure. This would offer a generic failover / reconnection scheme that would could work across a variety of setups, which is something I think we want in a client library.

martensson · 2018-01-23T13:48:46Z

Its been a while since last update, what is the status right now to support SRV and/or multiple endpoints?

Since etcd is a distributed storage, it makes sense for a client library to be able to connect to all the nodes in the cluster, and to manage their current status without needing to be reinstantiated. This patch adds the ability for the client to be aware of the status of all nodes in the cluster, and - in case of failure - to mark each of them as temporarily failed. Individual requests will continue to raise an exception in case of a failed connection at this point, but it would be easy to allow retrying if it seems a good idea. What has been done in practice: * Added a 'Endpoint' class, a simple FSM to abstract a remote server. * Refactored the initialization of the library, adding a switch to allow failing over to another server in case of need. * Refactored self.channel to be a function returning a grpc channel to the first non-failed node, and all the Etcd3Client.*stub properties accordingly. * Refactored the _handle_error wrapper, made it part of the class, and split it between the version for generators and the one for normal returners for code simplicity

lavagetto · 2018-04-07T15:17:24Z

As requested in the past review, I factored everything out and I made the first patch completely about connection management:

It is possible to define multiple endpoints if we feel like it
If failover is allowed, endpoints can be temporarily marked as down and another one is selected.
This can be easily extended to use SRV records (it can even be done externally for the time being)

I would love to see this patch merged soon so that I don't get to rebase it again - I don't have much time to dedicate to this at the moment, to be honest.

lavagetto · 2018-04-07T15:24:05Z

It should be noted that the travis tests fail randomly when trying to create keys via etcdctl, and they consistently succeed locally on my test infrastructure.

kragniz · 2018-04-13T10:53:03Z

@jd do you have any time to review this again?

jd

Looks pretty good but I'm sad that they are changes that seem to be completely unrelated to the purpose of the patch itself.

jd · 2018-04-13T11:55:12Z

etcd3/client.py

+        self.watcher = self.get_watcher()
+        self.transactions = Transactions()
+
+    @property


Not against this change but it really does not have anything to do with the intention of this patch, right?
It also makes new object each time which has a cost. I'd prefer to see this kind of change in a different PR.

It does. You need to ensure the various stubs use the correct channel, which is the currently active one. If you don't do this, all these properties will contain a reference to the channel used at the time you created the object. The alternative approach would be to reinstantiate every single one of these objects on every reconnection, which is fine but it felt less elegant and error-prone to me.

jd · 2018-04-13T11:56:26Z

etcd3/client.py

+        try:
+            return self.endpoint_in_use.use()
+        except ValueError as e:
+            if not self.failover:


nitpick:

if self.failover: pass else: raise

avoids negation :)

ack, will fix :)

jd · 2018-04-13T11:58:52Z

etcd3/client.py


 class Etcd3Client(object):
-    def __init__(self, host='localhost', port=2379,
+    def __init__(self, host='localhost', port=2379, endpoints=None,


The type of endpoints here is not documented by any docstring and looking at the code it seems to be a dict? I'd expect it to be a list. That needs clarification as to why it'd need to be a dict.

I was undecided - probably a list of endpoints is the most user-friendly thing to do, yes.

DanielSiebert · 2020-04-17T11:40:31Z

Hi,

what is going on with this patch? Has equivalent functionality been added in the meantime?

Cheers,
Daniel

chryseosTang · 2021-01-25T09:40:53Z

Has the project been completed?

InvalidInterrupt · 2021-05-20T01:10:20Z

I've done more work on this feature in my branch: https://github.com/InvalidInterrupt/python-etcd3/tree/reconnections.
It's still lacking documentation and probably could use more tests. Now that this project seems to be active again, what can I do to help get this feature merged?

kragniz · 2021-05-22T11:46:48Z

@InvalidInterrupt open a new PR and tag me in it - I'll set some time aside to review it

AlexShemeshWix · 2022-06-14T12:24:13Z

when release? pypi is still on 0.12.0

kragniz mentioned this pull request Apr 19, 2017

etcd3.client() support etcd cluster #112

Closed

lavagetto force-pushed the reconnections branch from 85ccf6d to 54e10a2 Compare June 16, 2017 08:01

lavagetto force-pushed the reconnections branch from 54e10a2 to df3bcac Compare June 16, 2017 10:49

lavagetto changed the title ~~[WiP] Add SRV record support, reconnection capability~~ Add SRV record support, reconnection capability Jun 18, 2017

jd requested changes Jun 21, 2017

View reviewed changes

CyberDem0n mentioned this pull request Sep 15, 2017

Heads-up: Etcd3 will require significant changes patroni/patroni#241

Closed

lavagetto force-pushed the reconnections branch from df3bcac to a2233ef Compare April 7, 2018 15:11

lavagetto changed the title ~~Add SRV record support, reconnection capability~~ Add ability to pass multiple endpoints, failover capability Apr 7, 2018

Merge branch 'master' into reconnections

5ca831b

jd requested changes Apr 13, 2018

View reviewed changes

mattzque mentioned this pull request Jul 10, 2018

Add support for https endpoints #340

Open

casper-lc mentioned this pull request Aug 24, 2020

Why not support multi-hosts in etcd v3 client #1258

Open

InvalidInterrupt mentioned this pull request May 22, 2021

Add support for multiple endpoints and failover #1596

Merged

kragniz merged commit 4013d63 into kragniz:master Jul 6, 2021

Conversation

lavagetto commented Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kragniz commented Apr 3, 2017

Uh oh!

lavagetto commented Apr 10, 2017

Uh oh!

lavagetto commented Jun 16, 2017

Uh oh!

jd commented Jun 16, 2017

Uh oh!

lavagetto commented Jun 16, 2017 via email

Uh oh!

lavagetto commented Jun 18, 2017

Uh oh!

jd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lavagetto commented Jun 21, 2017

Uh oh!

jordimarinvalle commented Jun 29, 2017

Uh oh!

Gollam commented Aug 29, 2017

Uh oh!

davissp14 commented Sep 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martensson commented Jan 23, 2018

Uh oh!

lavagetto commented Apr 7, 2018

Uh oh!

lavagetto commented Apr 7, 2018

Uh oh!

kragniz commented Apr 13, 2018

Uh oh!

jd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lavagetto commented Apr 2, 2017 •

edited

Loading

codecov-io commented Apr 2, 2017 •

edited

Loading

davissp14 commented Sep 15, 2017 •

edited

Loading