Add basic query stats collection & logging. #3539

tomwilkie · 2020-11-25T15:43:54Z

What this PR does:

Current collects & logs wall clock time and number of series in the queries.

TODO:

collection number of samples
propagate & aggregate in the query frontend
add some metrics.
for labels and series endpoints

Signed-off-by: Tom Wilkie tom.wilkie@gmail.com

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/api/handlers.go

pkg/querier/blocks_store_queryable.go

pstibrany

LGTM

I'd suggest moving context-setup code and stats logging into frontend handler package, to be able to log the query itself, and not just HTTP method and request URI. I think these stats are pointless without logging the query itself.

pkg/querier/stats/report_middleware.go

pstibrany · 2020-12-04T08:28:23Z

pkg/querier/stats/report_middleware.go

+			"msg", "query stats",
+			"user", userID,
+			"method", r.Method,
+			"path", r.URL.Path,


I think we should log the query itself as well. Perhaps it would make sense to move this stats logging code to the transport.Handler, where frontend also logs slow queries. WDYT?

Makes sense. Done.

So if you ask for "stats" you get both metrics and logging of every query?
The latter seems overkill to me. Why do we want all the 2ms queries in the log file?

It can be useful if you want to bill your customers based on number of queries.

That requirement seems better met by a metric than logs.

Also could already be met by setting -frontend.log-queries-longer-than negative.

Also could already be met by setting -frontend.log-queries-longer-than negative.

That would track the HTTP response time but no the wall time aggregated by all queriers (which was the need we had here).

Also could already be met by setting -frontend.log-queries-longer-than negative.

Yes, we also track cortex_query_seconds_total. Whether the Cortex operator would use metric or log really depends on their use case.

OK I will make a PR to have a separate option.

pkg/frontend/config.go

pkg/frontend/v2/frontend.go

pkg/querier/stats/stats.go

pracucci · 2020-12-04T18:20:22Z

Thanks @pstibrany for your review! I should have addressed your comments.

bboreham · 2020-12-07T11:04:11Z

CHANGELOG.md

@@ -9,6 +9,8 @@
  - limit for outgoing gRPC messages has changed from 2147483647 to 16777216 bytes
  - limit for incoming gRPC messages has changed from 4194304 to 104857600 bytes
 * [FEATURE] Distributor/Ingester: Provide ability to not overflow writes in the presence of a leaving or unhealthy ingester. This allows for more efficient ingester rolling restarts. #3305
+* [FEATURE] Query-frontend: introduced query statistics logged in the query-frontend when enabled via `-frontend.query-stats-enabled=true`. When enabled the following metrics are also tracked: #3539
+  - `cortex_query_seconds_total`


Can you explain somewhere (e.g. in the docs) how this metric differers from cortex_request_duration_seconds{route="api_prom_api_v1_query_range"}

Yes, sure. Done.

gouthamve · 2020-12-07T16:58:52Z

Note: prometheus/prometheus#6890 is the right way to track samples and series and is currently blocked because some benchmarks are significantly worse.

When we try to implement the series and samples tracking ourselves, we should keep an eye on the benchmarks.

pracucci · 2020-12-09T08:14:57Z

Note: prometheus/prometheus#6890 is the right way to track samples and series and is currently blocked because some benchmarks are significantly worse.

When we try to implement the series and samples tracking ourselves, we should keep an eye on the benchmarks.

Agree 💯 on the benchmarks. However, I believe there's no reasonable way to implement it ourselves. I think should be tracked by PromQL engine to get the count done correctly.

Current collects & logs wall clock time and number of series in the queries. TODO: - collection number of samples - propagate & aggregate in the query frontend - add some metrics. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Only partially done, still need to merge & record the results in the query frontend. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pstibrany

LGTM, thanks!

bboreham · 2021-01-05T14:14:31Z

pkg/frontend/transport/handler.go

 }

 func (cfg *HandlerConfig) RegisterFlags(f *flag.FlagSet) {
 	f.DurationVar(&cfg.LogQueriesLongerThan, "frontend.log-queries-longer-than", 0, "Log queries that are slower than the specified duration. Set to 0 to disable. Set to < 0 to enable on all queries.")
 	f.Int64Var(&cfg.MaxBodySize, "frontend.max-body-size", 10*1024*1024, "Max body size for downstream prometheus.")
+	f.BoolVar(&cfg.QueryStatsEnabled, "frontend.query-stats-enabled", false, "True to enable query statistics tracking. When enabled, a message with some statistics is logged for every query. This configuration option must be set both on query-frontend and querier.")


Why did this get the "frontend" name when it goes on both?

It's now only set on frontend: #3595

pull-request-size bot added the size/M label Nov 25, 2020

pracucci reviewed Nov 26, 2020

View reviewed changes

pkg/api/handlers.go Outdated Show resolved Hide resolved

pkg/querier/blocks_store_queryable.go Outdated Show resolved Hide resolved

pull-request-size bot added size/L and removed size/M labels Nov 26, 2020

pracucci force-pushed the querystats branch from 06fbe27 to 36feec8 Compare December 3, 2020 09:52

pull-request-size bot added size/XL and removed size/L labels Dec 3, 2020

pracucci marked this pull request as ready for review December 3, 2020 16:10

pull-request-size bot added size/L and removed size/XL labels Dec 3, 2020

pstibrany approved these changes Dec 4, 2020

View reviewed changes

bboreham reviewed Dec 7, 2020

View reviewed changes

tomwilkie and others added 15 commits December 9, 2020 09:17

Add basic query stats collection & logging.

c63a66c

Current collects & logs wall clock time and number of series in the queries. TODO: - collection number of samples - propagate & aggregate in the query frontend - add some metrics. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Use time.Since()

413014e

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Make Stats a proto so we can propagate it over gRPC.

1932681

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Fixed series tracker

85f6739

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Track number of samples too

cbcfb7b

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Propagate the stats via gRPC

77b4702

Only partially done, still need to merge & record the results in the query frontend. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Plugged in stats tracker in the query-frontend

7d4d49a

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Removed series tracking

66948ed

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Fixed tests

956ce3c

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Fixed wall time tracking

caeda34

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Added method and path to log

b98bf6c

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Introduced a CLI flag to enable query stats

567ecdf

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Fixed TODOs and linter

3936ce1

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Updated doc and CHANGELOG

6eed25a

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Rolledback samples tracking

cda7c1f

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci added 5 commits December 9, 2020 09:17

Addressed easy comments

18528a3

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Moved query stats reporter to frontend transport.Handler

ef5905e

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Renamed query stats log fields

f0b70f7

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Improved log message

76f2ae8

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Updated CHANGELOG

b9b57e7

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci force-pushed the querystats branch from cad539a to b9b57e7 Compare December 9, 2020 08:23

pstibrany approved these changes Dec 9, 2020

View reviewed changes

pracucci merged commit 6365787 into master Dec 9, 2020

pracucci deleted the querystats branch December 9, 2020 09:18

pracucci mentioned this pull request Dec 11, 2020

Propagate whether query stats is enabled from query-frontend down the request path #3595

Merged

3 tasks

bboreham reviewed Jan 5, 2021

View reviewed changes

Add basic query stats collection & logging. #3539

Add basic query stats collection & logging. #3539

Uh oh!

Conversation

tomwilkie commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pstibrany left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pracucci commented Dec 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gouthamve commented Dec 7, 2020

Uh oh!

pracucci commented Dec 9, 2020

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomwilkie commented Nov 25, 2020 •

edited

Loading

pstibrany left a comment •

edited

Loading