Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bugs in logs results caching and its tests #7925

Merged

Conversation

sandeepsukhani
Copy link
Contributor

@sandeepsukhani sandeepsukhani commented Dec 13, 2022

What this PR does / why we need it:
When a logs query results in an empty response, we cache it to avoid doing that query again and respond straight away with an empty response. However, we cache a single entry per time split interval with the query itself to keep things simple. For example, if the time split config for tenant A is 30m, then queries for intervals 10m-20m and 21m-25m would have the same cache key.

Here is roughly how cache hit is handled:

  • If the new query is within the cached query bounds, return empty results
  • If the start of new query is before the start time of the cached query, do a query from newQuery.Start to cachedQuery.Start
  • If the response of last query is also empty, set cachedQuery.Start = newQuery.Start
  • If the end of new query is after the end time of the cached query, do a query from cachedQuery.End to newQuery.Start
  • If the response of last query is also empty, set cachedQuery.End = newQuery.End
  • If we have changes in cachedQuery.Start/End, update it in the cache.

The problem here is when we do queries to fill the gap, we sometimes do queries for the range outside of what the user requested and respond back without reducing the response to what the user requested. For example, if the cached query is from 21m-25m and the user query is from 10m-15m, we will query for the whole gap i.e 10m-21m. If there are logs from 15m-21m in the response, we will unexpectedly send it back to the user.

This PR takes care of this issue by extracting the data and sending back only the user's requested logs.

I have also found the tests for logs results cache were incorrect. They heavily use mergeLokiResponse for building the test data, but they were not setting the limit in test queries which was causing mergeLokiResponse to send back empty results. This means were were always comparing two empty results, which would always be the same.
This PR also takes care of fixing it, and relevant changes to correct the test.

Checklist

  • Tests updated
  • CHANGELOG.md updated

@sandeepsukhani sandeepsukhani requested a review from a team as a code owner December 13, 2022 12:47
Comment on lines -183 to -184
if !lokiReq.GetStartTs().After(cachedRequest.GetStartTs()) && lokiReq.GetStartTs().Equal(cachedRequest.GetStartTs()) &&
!lokiReq.GetEndTs().Before(cachedRequest.GetEndTs()) && lokiReq.GetEndTs().Equal(cachedRequest.GetEndTs()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check looked wrong to me, so I fixed it as well.

* filter out response from extended queries we make for filling the cache gap
* fix response building and comparison in tests
@sandeepsukhani sandeepsukhani force-pushed the fix-log-result-caching-bug branch from 33451cc to a7cbe87 Compare December 13, 2022 13:30
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.6%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

},
}
for _, stream := range r.Data.Result {
if stream.Entries[0].Timestamp.After(end) || stream.Entries[len(stream.Entries)-1].Timestamp.Before(start) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition one above uses ts.Nano() where as here we use time.After for comparison. Can we make both same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with different ways to compare the time for convenience and readability.
It is hard to do <= or >= with time.After and time.Before while keeping it simple to read.
If I do the same here, then it would make the code a little inefficient due to having to do the conversion for each log line. I think it would be good to keep it as is, but please feel free to push back if you feel otherwise.

Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Just minor nit about consistency in time comparisons.

Either we stick to converting to UnixNano() and do < , >, = on it, Or do time.Compare . but not both.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.6%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@sandeepsukhani sandeepsukhani merged commit 24deb6e into grafana:main Jan 4, 2023
sandeepsukhani added a commit to sandeepsukhani/loki that referenced this pull request Jan 4, 2023
**What this PR does / why we need it**:
When a logs query results in an empty response, we cache it to avoid
doing that query again and respond straight away with an empty response.
However, we cache a single entry per time split interval with the query
itself to keep things simple. For example, if the time split config for
tenant `A` is `30m`, then queries for intervals `10m`-`20m` and
`21m`-`25m` would have the same cache key.

Here is roughly how cache hit is handled:

* If the new query is within the cached query bounds, return empty
results
* If the start of new query is before the start time of the cached
query, do a query from `newQuery.Start` to `cachedQuery.Start`
* If the response of last query is also empty, set `cachedQuery.Start` =
`newQuery.Start`
* If the end of new query is after the end time of the cached query, do
a query from `cachedQuery.End` to `newQuery.Start`
* If the response of last query is also empty, set `cachedQuery.End` =
`newQuery.End`
* If we have changes in `cachedQuery.Start/End`, update it in the cache.

The problem here is when we do queries to fill the gap, we sometimes do
queries for the range outside of what the user requested and respond
back without reducing the response to what the user requested. For
example, if the cached query is from `21m`-`25m` and the user query is
from `10m`-`15m`, we will query for the whole gap i.e `10m`-`21m`. If
there are logs from `15m`-`21m` in the response, we will unexpectedly
send it back to the user.

This PR takes care of this issue by extracting the data and sending back
only the user's requested logs.

I have also found the tests for logs results cache were incorrect. They
heavily use
[mergeLokiResponse](https://github.com/grafana/loki/blob/e2842c69c5e089389b697faa792e368f42f84c19/pkg/querier/queryrange/codec.go#L1014)
for building the test data, but they were not setting the `limit` in
test queries which was causing `mergeLokiResponse` to send back empty
results. This means were were always comparing two empty results, which
would always be the same.
This PR also takes care of fixing it, and relevant changes to correct
the test.

**Checklist**
- [x] Tests updated
- [x] `CHANGELOG.md` updated
@DylanGuedes DylanGuedes added type/bug Somehing is not working as expected backport release-2.7.x add to a PR to backport it into release 2.7.x labels Feb 23, 2023
grafanabot pushed a commit that referenced this pull request Feb 23, 2023
**What this PR does / why we need it**:
When a logs query results in an empty response, we cache it to avoid
doing that query again and respond straight away with an empty response.
However, we cache a single entry per time split interval with the query
itself to keep things simple. For example, if the time split config for
tenant `A` is `30m`, then queries for intervals `10m`-`20m` and
`21m`-`25m` would have the same cache key.

Here is roughly how cache hit is handled:

* If the new query is within the cached query bounds, return empty
results
* If the start of new query is before the start time of the cached
query, do a query from `newQuery.Start` to `cachedQuery.Start`
* If the response of last query is also empty, set `cachedQuery.Start` =
`newQuery.Start`
* If the end of new query is after the end time of the cached query, do
a query from `cachedQuery.End` to `newQuery.Start`
* If the response of last query is also empty, set `cachedQuery.End` =
`newQuery.End`
* If we have changes in `cachedQuery.Start/End`, update it in the cache.

The problem here is when we do queries to fill the gap, we sometimes do
queries for the range outside of what the user requested and respond
back without reducing the response to what the user requested. For
example, if the cached query is from `21m`-`25m` and the user query is
from `10m`-`15m`, we will query for the whole gap i.e `10m`-`21m`. If
there are logs from `15m`-`21m` in the response, we will unexpectedly
send it back to the user.

This PR takes care of this issue by extracting the data and sending back
only the user's requested logs.

I have also found the tests for logs results cache were incorrect. They
heavily use
[mergeLokiResponse](https://github.com/grafana/loki/blob/e2842c69c5e089389b697faa792e368f42f84c19/pkg/querier/queryrange/codec.go#L1014)
for building the test data, but they were not setting the `limit` in
test queries which was causing `mergeLokiResponse` to send back empty
results. This means were were always comparing two empty results, which
would always be the same.
This PR also takes care of fixing it, and relevant changes to correct
the test.

**Checklist**
- [x] Tests updated
- [x] `CHANGELOG.md` updated

(cherry picked from commit 24deb6e)
DylanGuedes added a commit that referenced this pull request Feb 23, 2023
Backport 24deb6e from #7925

---------

Co-authored-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release-2.7.x add to a PR to backport it into release 2.7.x size/L type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants