-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bugs in logs results caching and its tests #7925
fix bugs in logs results caching and its tests #7925
Conversation
if !lokiReq.GetStartTs().After(cachedRequest.GetStartTs()) && lokiReq.GetStartTs().Equal(cachedRequest.GetStartTs()) && | ||
!lokiReq.GetEndTs().Before(cachedRequest.GetEndTs()) && lokiReq.GetEndTs().Equal(cachedRequest.GetEndTs()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check looked wrong to me, so I fixed it as well.
* filter out response from extended queries we make for filling the cache gap * fix response building and comparison in tests
33451cc
to
a7cbe87
Compare
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0.6%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
}, | ||
} | ||
for _, stream := range r.Data.Result { | ||
if stream.Entries[0].Timestamp.After(end) || stream.Entries[len(stream.Entries)-1].Timestamp.Before(start) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition one above uses ts.Nano()
where as here we use time.After
for comparison. Can we make both same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with different ways to compare the time for convenience and readability.
It is hard to do <=
or >=
with time.After
and time.Before
while keeping it simple to read.
If I do the same here, then it would make the code a little inefficient due to having to do the conversion for each log line. I think it would be good to keep it as is, but please feel free to push back if you feel otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Just minor nit about consistency in time comparisons.
Either we stick to converting to UnixNano()
and do < , >, =
on it, Or do time.Compare
. but not both.
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0.6%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
**What this PR does / why we need it**: When a logs query results in an empty response, we cache it to avoid doing that query again and respond straight away with an empty response. However, we cache a single entry per time split interval with the query itself to keep things simple. For example, if the time split config for tenant `A` is `30m`, then queries for intervals `10m`-`20m` and `21m`-`25m` would have the same cache key. Here is roughly how cache hit is handled: * If the new query is within the cached query bounds, return empty results * If the start of new query is before the start time of the cached query, do a query from `newQuery.Start` to `cachedQuery.Start` * If the response of last query is also empty, set `cachedQuery.Start` = `newQuery.Start` * If the end of new query is after the end time of the cached query, do a query from `cachedQuery.End` to `newQuery.Start` * If the response of last query is also empty, set `cachedQuery.End` = `newQuery.End` * If we have changes in `cachedQuery.Start/End`, update it in the cache. The problem here is when we do queries to fill the gap, we sometimes do queries for the range outside of what the user requested and respond back without reducing the response to what the user requested. For example, if the cached query is from `21m`-`25m` and the user query is from `10m`-`15m`, we will query for the whole gap i.e `10m`-`21m`. If there are logs from `15m`-`21m` in the response, we will unexpectedly send it back to the user. This PR takes care of this issue by extracting the data and sending back only the user's requested logs. I have also found the tests for logs results cache were incorrect. They heavily use [mergeLokiResponse](https://github.com/grafana/loki/blob/e2842c69c5e089389b697faa792e368f42f84c19/pkg/querier/queryrange/codec.go#L1014) for building the test data, but they were not setting the `limit` in test queries which was causing `mergeLokiResponse` to send back empty results. This means were were always comparing two empty results, which would always be the same. This PR also takes care of fixing it, and relevant changes to correct the test. **Checklist** - [x] Tests updated - [x] `CHANGELOG.md` updated
**What this PR does / why we need it**: When a logs query results in an empty response, we cache it to avoid doing that query again and respond straight away with an empty response. However, we cache a single entry per time split interval with the query itself to keep things simple. For example, if the time split config for tenant `A` is `30m`, then queries for intervals `10m`-`20m` and `21m`-`25m` would have the same cache key. Here is roughly how cache hit is handled: * If the new query is within the cached query bounds, return empty results * If the start of new query is before the start time of the cached query, do a query from `newQuery.Start` to `cachedQuery.Start` * If the response of last query is also empty, set `cachedQuery.Start` = `newQuery.Start` * If the end of new query is after the end time of the cached query, do a query from `cachedQuery.End` to `newQuery.Start` * If the response of last query is also empty, set `cachedQuery.End` = `newQuery.End` * If we have changes in `cachedQuery.Start/End`, update it in the cache. The problem here is when we do queries to fill the gap, we sometimes do queries for the range outside of what the user requested and respond back without reducing the response to what the user requested. For example, if the cached query is from `21m`-`25m` and the user query is from `10m`-`15m`, we will query for the whole gap i.e `10m`-`21m`. If there are logs from `15m`-`21m` in the response, we will unexpectedly send it back to the user. This PR takes care of this issue by extracting the data and sending back only the user's requested logs. I have also found the tests for logs results cache were incorrect. They heavily use [mergeLokiResponse](https://github.com/grafana/loki/blob/e2842c69c5e089389b697faa792e368f42f84c19/pkg/querier/queryrange/codec.go#L1014) for building the test data, but they were not setting the `limit` in test queries which was causing `mergeLokiResponse` to send back empty results. This means were were always comparing two empty results, which would always be the same. This PR also takes care of fixing it, and relevant changes to correct the test. **Checklist** - [x] Tests updated - [x] `CHANGELOG.md` updated (cherry picked from commit 24deb6e)
What this PR does / why we need it:
When a logs query results in an empty response, we cache it to avoid doing that query again and respond straight away with an empty response. However, we cache a single entry per time split interval with the query itself to keep things simple. For example, if the time split config for tenant
A
is30m
, then queries for intervals10m
-20m
and21m
-25m
would have the same cache key.Here is roughly how cache hit is handled:
newQuery.Start
tocachedQuery.Start
cachedQuery.Start
=newQuery.Start
cachedQuery.End
tonewQuery.Start
cachedQuery.End
=newQuery.End
cachedQuery.Start/End
, update it in the cache.The problem here is when we do queries to fill the gap, we sometimes do queries for the range outside of what the user requested and respond back without reducing the response to what the user requested. For example, if the cached query is from
21m
-25m
and the user query is from10m
-15m
, we will query for the whole gap i.e10m
-21m
. If there are logs from15m
-21m
in the response, we will unexpectedly send it back to the user.This PR takes care of this issue by extracting the data and sending back only the user's requested logs.
I have also found the tests for logs results cache were incorrect. They heavily use mergeLokiResponse for building the test data, but they were not setting the
limit
in test queries which was causingmergeLokiResponse
to send back empty results. This means were were always comparing two empty results, which would always be the same.This PR also takes care of fixing it, and relevant changes to correct the test.
Checklist
CHANGELOG.md
updated