Admin API GET /status "database.reachable" field inaccurately reports Cassandra DB connection health as "true" when Database is offline #7305
Replies: 12 comments
-
Mr @bungle - I've been digging my way through https://github.com/Kong/kong/tree/master/kong/db trying to see just how Kong determines that reachable status after the Any pointers? I'm more than willing to try a PR on this |
Beta Was this translation helpful? Give feedback.
-
I mean... Kong obviously notices that nodes becomes unreachable. It prints in STDOUT that it marks each node as down - there should just be some way to say "If all nodes down, database is now unreachable" Or, fancier "If unable to meet consistency setting, database now unreachable" |
Beta Was this translation helpful? Give feedback.
-
This is where I would start to look at: If you turn off your cassandra and this still works: It is really a bug, so I would look that then: And then: Could it be that this: Is not always checked for "live connection". Should that be fixed? Or should this be more robust: E.g. actually use that connection and see if it works. |
Beta Was this translation helpful? Give feedback.
-
I'm thinking that's it's never checked for a "live connection" right now. The best way to test that may well just be a functional test, but I wouldn't want to make something too expensive on the DB... |
Beta Was this translation helpful? Give feedback.
-
It seems an actually reliable test of DB health is to hit the root path
will fail not only when all db nodes are unreachable, but also when consistency fails - which is the kinda functional test I really like |
Beta Was this translation helpful? Give feedback.
-
So it seems to fail here: Perhaps that you could test on |
Beta Was this translation helpful? Give feedback.
-
@bungle does Kong have a simple 1 entry table in the keyspace, like a Kong metadata table(maybe contains kong version/openresty version/nginx version or something of the sort thats not intense to read on that could be put in a light polling bg task(maybe once every 30 seconds or so, maybe configured to run as often as the db polling interval set in kong) of sorts that informs that json status element(which would fix the bug rather than hack something that just works for us but leaves this field with incorrect desired behavior)? I suppose the bandaid way would be in the |
Beta Was this translation helpful? Give feedback.
-
This is not a bug, and entirely expected behaviour so far for the As per its name, this property reports on whether or not Kong can reach the database (i.e. does not encounter any connectivity issues), and does not report on said database's "health". Monitoring the database's health is not one of Kong's responsibilities, and should be delegated to other monitoring tools, especially for a distributed database such as Cassandra where the definition of "healthy" is far from simply being a binary yes/no. Besides, no single query will determine the health of the Cassandra cluster with any certainty: there are too many variables for each query executed by a C* application (replication factor, consistency settings, built-in retry policies in the driver, etc...) so that if one succeeds, it absolutely does not guarantee that another query with a different partition layout and different CQL settings executed, say, on the proxy path, will actually succeed (e.g. even if some nodes are down, the "health" query could still succeed, while a proxy query may still fail). Imho, the only thing that could be reported in Kong's monitoring endpoint would be the number of C* nodes that are considered up and/or down by the driver. But again, such nodes could actually be healthy yet unreachable by Kong due to network connectivity issues. In my view, the closest we could get to monitoring C* nodes from Kong would be to track the reachability + the status of each node the driver is keeping track of, i.e. for each node Kong is connected to. Also, note that currently, DB reachability does not bypass the cosocket connection pool (as noted in the code), which, as far as I can tell and from the top of my head, isn't too much of an issue (if the connection is still open, the DB is still reachable). But the DB reachability test is also susceptible to Kong's DAO's stored connection mechanism, which could be an issue if |
Beta Was this translation helpful? Give feedback.
-
All the same @thibaultcha shouldn't it be counted as a connectivity issue when a Or are we talking more lower-level than that? "Reachable" in the sense that an ICMP ping would succeed to the host? Either way... we've currently a generic |
Beta Was this translation helpful? Give feedback.
-
It should indeed, but the driver having built-in support for connecting to healthy nodes only, what is needed is work in the C* driver itself to expose a low-level API allowing Kong to expose individual C* nodes health metrics. The
Again, we can expose low-level health metrics from the driver, but expecting a general, binary "healthy"/"not healthy" result isn't realistic imho. |
Beta Was this translation helpful? Give feedback.
-
Is there a strong aversion to having a passive monitor within Kong as a client to C*/Postgres that keeps tabs on if writes or reads are working based on Kongs current configurations? I agree its not Kong's role to know the state of the C* cluster as a whole. Other keyspaces, other applications etc. all could be using this Cassandra cluster and its not really in scope. But I think where I disagree is the idea that there is no value to knowing if Kong itself is facing problems against a Cassandra cluster(Or postgres node) based on how its currently configured(consistency settings, etc.). Its super helpful to know when Kong is failing consistently due to timeouts or unable to achieve consistency settings(Down C* nodes) etc. Maybe due to some network change in between Kong and the C* nodes such as firewalls or misconfigured low level network routers(because we do monitor intranode communication between C* also, so the monitoring we get from Kong gives us insight into network issues between Kong and C*). Right now this is sort of achieved by forcing a call to the admin api This concept seems out of scope from what the Kong team had in mind for the I still believe that if in runtime we do stop all C* node processes so the server won't be accepting connections on the db port that that I define that as |
Beta Was this translation helpful? Give feedback.
-
Hi, we are retagging this as a feature request because on this case Cassandra is indeed "reachable" (however the Cassandra driver defines this). It is conceded that reachable is not as useful as usable. So making it a usable flag (maybe a separate one?) implementing what @jeremyjpj0916 suggests, or using some other approach, is a feature request. |
Beta Was this translation helpful? Give feedback.
-
Summary
After Kong has started, if Cassandra database goes down, Admin-API /status still reports database:reachable=true
Steps To Reproduce
Even though
Additional Details & Logs
Beta Was this translation helpful? Give feedback.
All reactions