-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long-Range Query Weirdness #396
Comments
Thanks for reporting. Can you try without I might think this is a duplicate of #376 but let us know. Seems like downsampling is not working 100%, we need to investigate it closer. |
I've the same problem, same setup (Prom v2.2.1 and Thanos v0.1.0-rc1). |
Oh I misunderstood. There is no thanos strore. So indeed no downsampling issue for sure. Interesting.
I am suspecting some bug / configuration error with matching the time range and labels, but let's see. As discussed with @pcallewaert, he has different issues (downsampling one) |
The results I filed the bug about happened consistently throughout an afternoon. I just came back from a (very) long weekend, and found that the queries that were failing then (including the original timeframe, ending on June 29th) are now fully loading, but, if I move the window to look at more recent data, the same problem occurs. It seems to be about crossing some boundary back from the current time. If I ask for a shorter range, as mentioned in the bug report, things are fine, until the range crosses some threshold and then it hits. No external labels have been changed since this was all set up a while back. And, yeah, changing the downsampling setting has no effect. In fact, I'm not even sure it was available on the version I was using when I first noticed this bug. |
Interesting characteristics of the downsampling bug found by @asbjxrn:
Will be useful to track it down. |
I spent some time debugging this today. It appears that the Prometheus XOR encoding scheme used in the sidecar cannot encode/decode more than 2^16th number of samples. I think what is happening is the sidecar queries the remote read endpoint without a step size, gets more than 2^16 number of points back and that overflows a uint16 counter on the compression. Those points are then sent back to the querier via grpc with a size indicating only the amount of samples that overflowed the buffer. Any thoughts on what the proper fix is @bwplotka ? Here is a quick test that proves it. Apologies for copying the
|
I think this is the uint16 enforcing line in |
It seems like this comment might explain what's going on https://github.com/improbable-eng/thanos/blob/270d81feec83ba5defa19d64e022d84e3b2e1a2a/pkg/store/prometheus.go#L158-L160 If there are more than 2^16 samples, they might need to be split into multiple chunks |
Wow huge thanks @robertjsullivan for digging into it so deeply. Once again, I got positively surprised by commuinty! ❤️ 💪 So overall.. The limit on of
You and @kitemongerer are totally right here. I think you spotted valid bug. We totally missed it as we were focused on progressing on this: (https://docs.google.com/document/d/1JqrU3NjM9HoGLSTPYOvR217f5HBKBiJTqikEB9UiJL0/edit?ts=5c2cdaf0#) which would solve this issue entirely. Since stream remote read is still in progress and since we care about backward compatibility the bug is valid and we should fix the the implementation. The fix is extremely straightforward to me, we should split series by 2^16 and produce biggest chunks possible, because the comment there is still valid:
I will leave the honour of fixing this to you @robertjsullivan (: If you prefer me or anyone doing fix, I am fine, let us know. |
…2^16 to avoid overflowing XOR compression Signed-off-by: Robert Sullivan <robert.j.sullivan@gmail.com>
Created the PR. Let me know if anything needs to change. Thanks! |
Thanos, Prometheus and Golang version used
What happened
Queried for weeks of data, and found a bunch of it missing:
With a tighter view, Thanos can return the data from the lines on the right, just not with a wide view.
What you expected to happen
Seeing weeks of data (here's one of the underlying Prometheus instances):
How to reproduce it (as minimally and precisely as possible):
Unsure. I've done some experimentation to see exactly where the threshold between "full data" and "partial data" lies.
At a query width (unsure what the term for this is) of:
(I have screenshots for all of the above)
The same problem also happens if you remove the irate() part, and just query for raw data.
Full logs to relevant components
Querier:
Sidecar for Replica A:
The other replica's sidecar has identical logs except for the timestamps and port numbers. (All components are running on the same VM)
Anything else we need to know
Object store isn't, in play here. This is just a pair of Prometheus instances with the default 15d retention.
I updated to latest master today after first seeing the problem before starting to gather data for a bug report.
The text was updated successfully, but these errors were encountered: