Add ability to pass multiple endpoints, failover capability#106
Add ability to pass multiple endpoints, failover capability#106kragniz merged 2 commits intokragniz:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #106 +/- ##
==========================================
+ Coverage 92.86% 93.13% +0.26%
==========================================
Files 10 10
Lines 617 714 +97
==========================================
+ Hits 573 665 +92
- Misses 44 49 +5
Continue to review full report at Codecov.
|
|
Hey, I've only skimmed over your patch, but it looks great so far! |
|
@kragniz as expected, I had a little time to work on this on sunday, and I found a few small issues with my code, that I am still fixing. It might take another week or two (next sunday is Easter here). |
|
I'll look into the test coverage to make it better (I think the whole SRV DNS part is still not tested), but I consider the patch in a reviewable state now. |
|
Does this cover #150 also? |
|
No, it currently doesn't, but
1 - it's trivial to Support
2 - if you set reconnect to True, the library will autodiscover active
members
G.
Il ven 16 giu 2017, 10:32 Julien Danjou <notifications@github.com> ha
scritto:
… Does this cover #150 <#150>
also?
I'll try to review this patch ASAP.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABaDYD5HgMZDDsQ5KsvoJSmu4TCn1Zumks5sEj2hgaJpZM4Mw21h>
.
|
|
I think this patch is ready to be reviewed/merged, but please let me know if you want me to improve something. |
jd
left a comment
There was a problem hiding this comment.
Here's a bunch of review. I feel bad that this is only one patch when there is actually 3 differents pieces:
-
Handle jumping between multiple servers
-
Doing a SRV requests to gather the initial server(s)
-
Read the cluster topology to jump between servers
It'd have be much better to do 3 different incremental patches.
Thank you for doing that anyway! That looks promising.
| self.host = host | ||
| self.netloc = "{host}:{port}".format(host=host, port=port) | ||
| self.secure = secure | ||
| self.protocol = 'https' if secure else 'http' |
There was a problem hiding this comment.
why http/https? there's no http to me here.
There was a problem hiding this comment.
that's how etcd itself refers to its endpoints in the configuration. I checked and in fact there is still no official designation for the grpc protocol in terms of URI names. So I went with what coreos does.
etcd3/client.py
Outdated
| """Represents an etcd cluster member.""" | ||
|
|
||
| creds = None | ||
| time_retry = 300.0 |
There was a problem hiding this comment.
This ought to be configurable.
There was a problem hiding this comment.
Agreed, it can be configured simply by setting
etcd3.Node.time_retry = X
Do you have something else in mind?
There was a problem hiding this comment.
Well that's not obvious in the API since I don't expect users to go change every Node object that is stored in the Client object. And changing a global class is a bad idea.
etcd3/client.py
Outdated
| for node in available: | ||
| # A node might have failed in the meanwhile | ||
| if node.failed: | ||
| next |
There was a problem hiding this comment.
I don't think this does what you think it does
>>> next
<built-in function next>
You meant continue I imagine
There was a problem hiding this comment.
meh, I fixed this already but didn't commit it.
etcd3/client.py
Outdated
| else: | ||
| available = [m for m in self._nodes_cache.values() | ||
| if not m.failed] | ||
| for node in available: |
There was a problem hiding this comment.
No need to build a list in available, just iterate on self._nodes_cache.values() and add a if on L221. This double iteration looks useless.
There was a problem hiding this comment.
yeah it's a leftover from a previous version of the code
etcd3/client.py
Outdated
| self._url = '{host}:{port}'.format(host=host, port=port) | ||
| def __init__(self, host='localhost', port=2379, srv_record=None, | ||
| ca_cert=None, cert_key=None, cert_cert=None, | ||
| timeout=None, reconnect=False): |
There was a problem hiding this comment.
From what I understand, reconnect is a misnomer. It's not reconnection that is handled but jumping from one node to another.
There was a problem hiding this comment.
I agree but I couldn't come up with anything better: if you have a better name, I'm happy to use it!
etcd3/client.py
Outdated
| cert_cert_file.read() | ||
| ) | ||
|
|
||
| def _discover(self, srv_name): |
There was a problem hiding this comment.
It'd make it clearer if you switch this method to be a @staticmethod. Just pass uses_secure_channel as an arg.
etcd3/client.py
Outdated
| self.uses_secure_channel | ||
| ) | ||
| hosts[node.netloc] = node | ||
| if not len(hosts): |
There was a problem hiding this comment.
This'd be better be raised by the caller of _discover().
etcd3/client.py
Outdated
| def watchstub(self): | ||
| return etcdrpc.WatchStub(self.channel) | ||
|
|
||
| def _get_cluster_topology(self): |
There was a problem hiding this comment.
I see 2 problems with that method:
-
It's messing with _nodes_cache, and I don't like that. I'd prefer to it to return a new list of nodes that can be used by the caller
-
It's only called at
__init__time. What if the topology changes later?
There was a problem hiding this comment.
1 - I don't really see why it's a problem, the whole point of that method is to mess with _nodes_cache, but I'm ok with doing it within the caller.
2 - In python-etcd (the v2 client) we ran a similar function every time we had one connection failing: we would fail the node, try the call again to another node, and if successful, also upgrade the cluster topology to check if it changed. I thought it was too much to add that as well in a single patch, and there are some drawbacks from that approach too.
There was a problem hiding this comment.
Then just remove that from that patch. You don't need. We can discuss this in another patch/PR.
|
Thanks for the review @jd I'll try to fix these up ASAP. |
|
@lavagetto great work. I was going to develop it, I really need it and this library works with v3. I would like to know when it is going to be merged... I hope soon! |
|
any updates? |
Many people don't directly connect to their members, but instead front their cluster with gRPC proxy(s), HAProxy(s), etc. When this is the case, we wouldn't want to auto-discover the cluster endpoints. I think it may be worth considering simply offering users the ability to define multiple endpoints on initialization and rotate endpoints on failure. This would offer a generic failover / reconnection scheme that would could work across a variety of setups, which is something I think we want in a client library. |
|
Its been a while since last update, what is the status right now to support SRV and/or multiple endpoints? |
Since etcd is a distributed storage, it makes sense for a client library to be able to connect to all the nodes in the cluster, and to manage their current status without needing to be reinstantiated. This patch adds the ability for the client to be aware of the status of all nodes in the cluster, and - in case of failure - to mark each of them as temporarily failed. Individual requests will continue to raise an exception in case of a failed connection at this point, but it would be easy to allow retrying if it seems a good idea. What has been done in practice: * Added a 'Endpoint' class, a simple FSM to abstract a remote server. * Refactored the initialization of the library, adding a switch to allow failing over to another server in case of need. * Refactored self.channel to be a function returning a grpc channel to the first non-failed node, and all the Etcd3Client.*stub properties accordingly. * Refactored the _handle_error wrapper, made it part of the class, and split it between the version for generators and the one for normal returners for code simplicity
|
As requested in the past review, I factored everything out and I made the first patch completely about connection management:
I would love to see this patch merged soon so that I don't get to rebase it again - I don't have much time to dedicate to this at the moment, to be honest. |
|
It should be noted that the travis tests fail randomly when trying to create keys via etcdctl, and they consistently succeed locally on my test infrastructure. |
|
@jd do you have any time to review this again? |
jd
left a comment
There was a problem hiding this comment.
Looks pretty good but I'm sad that they are changes that seem to be completely unrelated to the purpose of the patch itself.
| self.watcher = self.get_watcher() | ||
| self.transactions = Transactions() | ||
|
|
||
| @property |
There was a problem hiding this comment.
Not against this change but it really does not have anything to do with the intention of this patch, right?
It also makes new object each time which has a cost. I'd prefer to see this kind of change in a different PR.
There was a problem hiding this comment.
It does. You need to ensure the various stubs use the correct channel, which is the currently active one. If you don't do this, all these properties will contain a reference to the channel used at the time you created the object. The alternative approach would be to reinstantiate every single one of these objects on every reconnection, which is fine but it felt less elegant and error-prone to me.
| try: | ||
| return self.endpoint_in_use.use() | ||
| except ValueError as e: | ||
| if not self.failover: |
There was a problem hiding this comment.
nitpick:
if self.failover:
pass
else:
raiseavoids negation :)
|
|
||
| class Etcd3Client(object): | ||
| def __init__(self, host='localhost', port=2379, | ||
| def __init__(self, host='localhost', port=2379, endpoints=None, |
There was a problem hiding this comment.
The type of endpoints here is not documented by any docstring and looking at the code it seems to be a dict? I'd expect it to be a list. That needs clarification as to why it'd need to be a dict.
There was a problem hiding this comment.
I was undecided - probably a list of endpoints is the most user-friendly thing to do, yes.
|
Hi, what is going on with this patch? Has equivalent functionality been added in the meantime? Cheers, |
|
Has the project been completed? |
|
I've done more work on this feature in my branch: https://github.com/InvalidInterrupt/python-etcd3/tree/reconnections. |
|
@InvalidInterrupt open a new PR and tag me in it - I'll set some time aside to review it |
|
when release? pypi is still on 0.12.0 |
This two commit add two features:
This is still a WiP (as you might have noticed, no docs and no additional tests, sorry but I really can only work on this on sundays :/), but I wanted to get some feedback on the direction I've taken with the code, given it is a non-negligible overhaul.
The way I built the reconnection logic might seem a bit of an overkill, but my experience with trying to build a simpler model for the etcd2 python library taught me otherwise. I tried to learn from that experience.