-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gocql does not re-resolve DNS names #831
Comments
Any follow up on this issue? We having similar issue with this as well when running in the kubernetes services. The IP of the cassandra change slightly (from 10.48.0.54 to 10.48.0.56) after the pod image has been updated to new version. When this happened, the error will be thrown out. |
Same problem with deploy in kubernetes. |
We were frequently getting this error but for us it ended up being Cassandra-related, not gocql-related. Our single instance Cassandra cluster was frequently hitting stop the world garbage collection for hundreds of milliseconds at a time. We optimized our application database operations, tuned the Cassandra JVM settings, changed the GC mechanism (CMS -> G1GC), lengthened the gocql timeouts (600ms -> 5000ms) and gave Cassandra more RAM. The GC stoppage after these changes is much less frequent and much shorter now and we no longer see this gocql error. Not sure if that helps any of you hitting this same error but might want to check your Cassandra system.log and/or gc.log to see if no hosts are available because Cassandra is not responding to events. |
TL;DR: In K8s (or equivalent), you should be using Pet Sets (or equivalent) with a stable hostname and linked volumes for your stateful services, like Cassandra. Basically this: http://blog.kubernetes.io/2016/07/thousand-instances-of-cassandra-using-kubernetes-pet-set.html Shifting hosts under DNS is sort of an anti-pattern in C* because it relies on concrete, addressable targets in order to gossip about and maintain cluster state. Cassandra is a stateful service and is in constant communication with its peers about their (and their neighbors) individual states. It's unreasonable to expect the state of your local DNS and TTLs would propagate in perfect sync with the state of an arbitrary number of nodes in a distributed system. Additionally, C* clients attempt to establish connections with all (or many) nodes in the cluster, not just the one(s) you provide in ClusterConfig. This allows clients to intelligently make decisions about query load balancing and cluster availability. Recall that Cassandra has no master nodes so all nodes are equally available to serve queries. The mantra is "no single point of failure" and, as you've discovered, DNS can be a single point of failure. This problem is not really related to gocql in particular. I believe you'd find the same problem in any of the other stable Cassandra drivers because of how Cassandra and any client driver is (and must be) designed. Regarding GC and high load: Running a single-node of Cassandra doesn't really make sense, but I understand if you're evaluating it from a development standpoint. (Even so, a small 3 node cluster will help familiarize yourself with consistency levels and replication) Long GC STW pauses is a strong sign that your "cluster" is overloaded. Tuning Cassandra and the JVM (especially the heap) proportional to your container's allocated CPU and memory for your deployment is usually necessary, in any case. Some reading on GC and C* tuning: |
@robusto Hi, thanks for your explain. We use Cassandra in K8s Stateful Sets (follow the official example), then the client can use a pod DNS name to connect to servers. But When some pods been deleted (e.g. because of node migration) while the data volumes still the same, the client still receive errors about the old IPs. Is there anyway to solve this problem? |
Can you build with the |
@Zariel Thank you. Output from
Output from gocql:
And output from C++ client in the same cluster (if it helps):
I reproduced this by following the steps I said in last comment. |
I'm assuming the node
Then returns with a new kubernetes assigned IP address here
Question is, does C* send a This log suggests C* sends the original ip address in the Perhaps gocql receives a @idealhack are you able to reproduce with the if you are using K8 StatefulSets then when the pod returns it should have the same IP address as before, and gocql should have no issue reconnecting to the node. Can you confirm the C* pod returns with the same ip address? |
@thrawn01 Thank you. I'm convinced that a pod's IP should not change when it crashes and restarts, but it will change when it has been deleted and another pod comes up, which is the situation I reproduced, with The logs above were wrote after the deletion and recreation, even more, the clients were also restarted, so I thought the old IPs should never appears in client logs. As I said, this leads me to one reason: the old IPs were stored and have been wrote to disk (and did not been removed after the pod been deleted). If so, I wondered if there is a way to avoid this? I will try to get more logs to cover when the pods been deleted. |
I deleted all pods (without deleting data on most node) again, and found out the first pod has those old IPs (I guess it reads from previous disk data). And I continued adding new pod. At last the
In the mean time, gocql reports (omits some redundant part):
Hmm, seems something wrong about those UP nodes, right? I stop gocql and run it again, it reports:
This is more like what Also I found kubernetes/kubernetes#49618 is a better way when stopping pod, I will change the stateful sets, clean all data, and try adding and deleting pods. |
I think whats going on here is your cassandra cluster has stale nodes in gossip. Gocql will get an node up event, then refresh the ring which returns the down nodes (the system tables do not include gossip state). The question here is how long should gocql keep trying to connect to downed nodes before they are removed from the local ring? If you do One issue I can see is that the ring describer wont remove nodes which is what leads to the first logs you posted https://github.com/gocql/gocql/blob/b96c067a43582b10f95d9e9dabb926483909908a/host_source.go#L663 What issue do you see when the driver is in this state? |
Sorry I'm not that familiar with cassandra nor gocql. I think the main issue is gocql somehow reports a ring contains some UP nodes which are DOWN actually. As time goes, the number of this kind of nodes keeps adding up, Eventually:
But according to Also I have not tried When I was posting the first comment I thought these errors may lead some consistence problems, but it seems consistence is only affect by the level. |
Since gocql does not try to reconnect what is the best way to handle "gocql: no hosts available in the pool"? |
@robdefeo We use gocql for our analytics engine at Mailgun. We currently restart the service once every 2 days to ensure the connection pool is full and monitor the size of the pool by emitting metrics on the size of the connection pool. (We modify gocql to achieve this) This is a temporary solution until we have sufficient time to formulate a full patch to gocql. If I don’t find time for working on a patch this quarter I’ll be very unhappy. This has been a major pain point for us. |
Hi, Folks, any updates on this story? We recently had a big outage that seems to be partially related to this error. I'm testing it locally and can see this error message showing. Basically what I tried is pause the cassandra docker image and restart it(To mimic whole cassandra cluster down). gocql complains |
I am using gocql in my kubernetes cluster with a 3-node cassandra setup. It works fine. However, if I want to test locally on my machine, I usually use kubectl port-forward xxx to be able to connect to the cassandra cluster: kubectl port-forward --namespace cassandra service/cassandra 9042:9042 gocql seems to have a problem with that, as it discovers the cluster but apparently wants to connect to the nodes directly:
10.42.96.11 is the local IP in the node cluster, but obviously this is not available locally on my machine, only localhost:9042 The weird thing is that after ~5 seconds of trying, my application starts, and I can query my Cassandra cluster. I tried setting: cluster.DisableInitialHostLookup = true
cluster.IgnorePeerAddr = true That didn't help, though. Also, after another 20-30 seconds, the node seems to be flapping up and down again:
Is there anything I can do to prevent this? Locally, it's totally fine if gocql only connects with a single node, it's just for development purposes, and as I said, gocql works perfectly fine when deployed in the production kubernetes cluster. |
@steebchen I've not testing your setup. I only use a single node locally to develop. However, disabling the initial host lookup will only keep gocql from asking the control node (the first connection) about other nodes in the cluster. It will not keep the control node from telling the client (gocql) about changes to the status of other nodes in the cluster. ( |
We also ran into similar issue recently with Cassandra deployed as statefulset in a Kubernetes cluster. A little details about our setup - On the Kubernetes cluster, Cassandra is exposed as Kubernetes service. Go clients then connect to Cassandra cluster through this Kubernetes service name which is essentially the DNS for to the ip address of the running container pod. Now about the issue - In the above scenario, Cassandra pod instance will come up with a different ip address but under the same DNS name. However, looking at the gocql documentations it looks like there is an assumption that user need to only pass the ip addresses and not the DNS which is really not possible in such setups because the moment a Cassandra node goes down & restarted, it will come up with a different ip address but with same DNS. Could it be the case that the gocql driver is unable to re-establish because it is still trying to do that against the old ip address? I feel such an assumption is not apt because if the driver is supplied with a DNS name, it should use it while trying to reconnect. This has been already resolved in the official Java Cassandra driver by Datastax team. Here is the ticket for your reference. So, would you please prioritize this ticket & help with necessary corrections? |
Okay, I'll try and have a look. Considering we're beginning to work with k8 as well, this may end up handy and sooner than expected. |
@alourie thank you for looking at it! Is there any timeline by when could we expect a resolution? Unfortunately, its a critical need for us. |
@sanjimoh Sorry, it would be hard to timeline it. I'm finishing up something first, then will get to it, probably mid-next week. From there it could take some time until I figure this out. As I said, we need it too, so I wouldn't delay this too much. |
Hi, did you get a chance to check this now? |
Not yet, but planning to do it next week.
…On Wed, Oct 30, 2019 at 12:37 AM Sanjit Mohanty ***@***.***> wrote:
Hi, did you get a chance to check this now?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#831>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAN2B6GDRD4FPZJIMIRXTDQRA7SZANCNFSM4CWKN7QQ>
.
--
Alex Lourie
|
I have some personal circumstances that won't allow me to look at this for awhile. Sorry about that. |
Hi - We are facing same problem described by @sanjimoh , is there any resolution for this yet? |
this happened today on my local machine where I forgot to close |
@alourie : Could it be worked on now? If not you, anyone else from the library maintainers? |
Of the people who have developed their own techniques to work around this which of the two apparent strategies are you using:
? Are there additional strategies? |
This is related to #1575, particularly #1575 (comment) |
I faced this problem too. I solved this with config ConnectObserver from ClusterConfig to listen to the connection state. When it has an issue and recreates another session |
I think this issue can now be closed since b9737dd was merged. |
In using the gocql library from about two weeks ago I noticed the following issue (we had been using a gocql version from last april and see the same issue, even though your code has changed in this area - you seemed to fix a big in ring.go). Our application runs in a cloud environment where cassandra instances can move around from node to node (IP addresses will be different). Therefore, we use dns to manage this. In this use case, we have a single node Cassandra cluster, and at application startup, pass in this name to our application (which is passed to the gocql Session abstraction). All works well until the cassandra node is restarted, which means it has started bound to a new IP. The control connection in gocql fails as it notices the connection to the old IP has been closed. As this point in time, the only answer is to restart out application because gocql has no way to know the new IP of the cassandra node. The issue seems to be that gocql loses information regarding the dns name we passed in. I'm new to gocql, but can't find a way (via some config setting) to address this issue in our application. Any help would be appreciated.
The text was updated successfully, but these errors were encountered: