-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: Incorrect error status code with multiple AlreadyExists #5407
Comments
I think I have found the origin of this issue. As the comment in This error is flattened with a comment explaining why it is. Lines 510 to 513 in 22aefd6
In the case where replicate is called in fan out it is not flattened, meaning that there is a risk that a multi error is placed inside of a multi error. Lines 475 to 490 in 22aefd6
|
I really think that there is something weird going on when the replication factor is greater than one. Decreasing the replication factor temporarily gets all Prometheus agents out of their bad loop, and then it can be increased again. When decreasing the replication factor it starts returning 409 errors instead which gets them to backoff and decrease their shards. The issue with testing this is that it is difficult to reproduce the error properly. |
I think we might be able to catch this with a good enough test. For clarification, do you only see this when running separate receivers and ingesters? You could run the same multi-tenant setup without having a split. |
This can also be a bug in how we verify quorum during replication: The actual formula should be
So with replication factor of 2, it should be sufficient for just one write to succeed. That is also what is described in the design doc: https://thanos.io/tip/proposals-done/201812-thanos-remote-receive.md/ |
It might take some time to setup a new environment without the ingestor split. I will try to get some time to write a unit test to reproduce this. |
Please ask if you need further information, we're experiencing this issue in a routing receive setup and obviously use multiple receive for scaling reasons, which makes applying the workaround quite hard. |
@Tontaube are you seeing the same issue with a replication factor of 1? I think that is the interesting detail, which will help confirm the idea that the issue is related to the replication factor logic. |
Can confirm, we see the same issue with a replication factor of 1 |
I cannot tell, as our environment recovered due to solved other issues. We saw the problem always then, when data came in more than 2 hours too late. Reducing the receive instances was not tried, on one hand due to the fact it being a gitops only environment (no manual adaptions in prod) and the assumption, that one receiver would not be capable to handle the load, anyway. |
Hey @phillebaba, I revisited this and I think you're totally right, we are unable to properly determine error cause if there is a multi-error within a multi-error. This can happen, especially with replication factor 2, which you mention is your case. Looking at this portion of the method: Lines 1126 to 1132 in 9237dd9
we see that the method returns a "cause" error (line 1129) if threshold is reached or it returns the original error (line 1132). What I'm suspecting is that setting threshold to write quorum can cause this unwanted multi-error nesting. I considered two different scenarios:
This might explain why this issue is not very prevalent with (I believe) more commonly used quorums like 3 or 5, where we usually reach the threshold. However, with quorum 2, if 1 request fails, this automatically means we always nest multi-error = we always return I hope this is making sense. If these assumptions are correct, I see two ways of fixing this: |
I can absolutely believe that there are some subtle issues in receive error handling. We've brought up discussions around this in the past and I fear that we haven't been completely rigorous in our assessment of receive errors :/// just to add some historical context, the behavior of error handling was changed when we changed multi-error libraries and how we flattened/nested them. this behavior entirely changed how errors are counted and thus how quorum is validated. This challenge has been further highlighted in the past by speed optimizations to the error handling path that may have simplified error handling for speed reasons but maybe did not account for all of the edge-cases. That's not to say that the goals of correctness and performance are incompatible. @matej-g I think your explanation sounds very very feasible however I'm worried that the we'll continue to play whackamole with receiver error issues if we aren't completely rigorous and exhaustive in how we analyze all of the potential error scenarios. I guess overall I'm mostly worried that we will implement heuristics for simplicity instead of being completely exhaustive.
I'm somewhat confused by this; could you clarify it? The most frequent error cause isn't necessarily the failure we care about bubbling up to the top and thus not necessarily what should dictate the final error code, right? I think we would be very well served to add explicit test cases for the scenarios you highlighted in handler_test.go so we can document the correct behavior in tests. |
Anyone working on this? |
Any resolution for this issue, We are getting this issue in production
|
Thanos, Prometheus and Golang version used:
Thanos: quay.io/thanos/thanos:v0.26.0
Prometheus: quay.io/prometheus/prometheus:2.36.0
Object Storage Provider:
Azure
What happened:
I am running a multi-tenant router-receiver setup with multiple Prometheus instances remote writing to it, and a replication factor of 2. At times an error occurs where remote writing fails, and Prometheus will have to resend time series. However when this occurs some time series may already be present in the receiver store and return a
AlreadyExists
error. This does not seem to be an issue when a single receiver already has the time series, but a bigger issue when multiple receivers returns the same error. The Prometheus instance will receive a 500 error and try again, forcing it to end up in an infinite loop where it will never be able to recover from.What you expected to happen:
I expect the correct error to be returned for this condition so that the Prometheus instance will stop sending the same time series which already exist in the receiver store.
How to reproduce it (as minimally and precisely as possible):
This error will only occur when receivers are being torn down and setup again one by one. While this is being done some replications may fail causing the Prometheus instance to resend the same data. It is hard to pin point more conditions to get this error to happen, as sometimes everything works out without any issues. When the error occurs every single Prometheus instance will fail to write as they will continue to attempt sending the same metrics over and over again.
Full logs to relevant components:
Anything else we need to know:
I have done some analysis of what is occurring and my guess is that the
determineWriteErrorCause
function has a bug in it where it does not detect the conflict error and return it.thanos/pkg/receive/handler.go
Line 730 in 17c5764
Looking at the handler code it is apparent that if a wrapped error is returned a 500 status code will be given instead of a 409 which is the correct one for conflict errors.
thanos/pkg/receive/handler.go
Lines 361 to 375 in 17c5764
Reading through the replication code it seems like the original
AlreadyExists
error is being wrapped two or three times into other errors. When this is done the function to determine if the error is a conflict will not work and instead it will return the whole error causing a 500 status.thanos/pkg/receive/handler.go
Line 484 in 22aefd6
thanos/pkg/receive/handler.go
Line 582 in 22aefd6
thanos/pkg/receive/handler.go
Line 665 in 22aefd6
We need to write a unit test specifically replicating the
AlreadyExists
error to verify thatdetermineWriteErrorCause
will return a conflict error.The text was updated successfully, but these errors were encountered: