-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Our Pinot cluster is not highly available #12888
Comments
@piby180 We are also having similar configurations which have deployed in prod. It would be great if you can post your observations here. Couple of questions,
|
Hi @vineethvp
On slack, I see quite a few users are experiencing this issue. The core issue is a server/broker is ”marked“ ready by Pinot but it is not yet ”marked“ ready by kubernetes because readiness probe hasn't succeeded yet. Because of this, Pinot start sending traffic to the server/broker but the server dns does not respond properly. Pinot needs to wait until the server/broker dns endpoint is working again. |
Can you describe the Pinot server statefulset and check the liveness probe? Ideally the Pinot server should be marked down first, then broker won't route requests to it, then k8s can shut it down. |
We have I see this error after server pod has finished restarting and now has healthy logs but the status of the pod is still not Ready yet. When the status of the pod is Ready, the error disappears. Here is the yaml for our server statefulset
|
Looks like I finally hit the jackpot on this issue after 2 months. I have added the following to pinot-server-headless and pinot-broker-headless services and the error seems to disappear.
Setting this to true means k8s will allow DNS lookup irrespective of whether the server pod is ready or not. Whether to query a server or not now solely depends on Pinot From Kubernetes docs
https://kubernetes.io/docs/reference/kubernetes-api/service-resources/service-v1/#ServiceSpec @vineethvp Could you test it on your end if adding |
cc @zhtaoxiang ^^ |
Hey all!
We are facing serious issue of high availability despite replication across all components. When one out of three broker/server pods is unavailable, some of our queries fail with error
So there is downtime whenever a server or broker pod restarts.
Cluster and Table Setup
Some context about our cluster configuration:
Here is our values.yaml
We query Pinot using three ways:
Our standard offline tables have the following config
Problem
When broker or server pod get restarted during cluster update or when our ops team make changes to kubernetes cluster, some of our queries fail.
With multistage disabled : It seems like the queries are routed in round robin fashion. If you retry the same query for 5 times, it will fail 1-2 times when it reach the server pod or broker pod which is restarting. For 3-4 times, it reach the healthy broker/server pods and return result.
With multistage enabled: The queries almost always fail when one of the broker or server pod is restarting. It seems the queries are fanning out to all servers.
Disabling multistage is not an option for us since we are using joins in some queries.
The error log we get when for example server-2 is restarting
Full query response in Pinot UI for a succeeded query with multistage enabled (here you can see the query is being routed to multiple servers)
SELECT COUNT(*) FROM pinot_metadata_feeds;
Full query response in Pinot UI for succeeded query with multistage disabled
SELECT COUNT(*) FROM pinot_metadata_feeds;
Expectation
Since we have 3 replicas for every segment and 3 replicas for every component, pinot must only route the queries to heathy broker/server pods and the query must not fail in case 1 out 3 server/broker pod is unavailable.
Solutions tried so far
Added the following to broker-conf
As of now, Pinot for us is not highly available despite following all best practices regarding replication. This scares us currently as we can face downtime any day if kubernetes restart one pod randomly which is not uncommon.
This issue is quite important for us as it was our base assumption that Pinot is Highly Available.
Any help is much appreciated!
Thanks!
FYI : This issue was first discussed on Slack here
The text was updated successfully, but these errors were encountered: