-
Notifications
You must be signed in to change notification settings - Fork 820
Add basic query stats collection & logging. #3539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'd suggest moving context-setup code and stats logging into frontend handler
package, to be able to log the query itself, and not just HTTP method and request URI. I think these stats are pointless without logging the query itself.
"msg", "query stats", | ||
"user", userID, | ||
"method", r.Method, | ||
"path", r.URL.Path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should log the query itself as well. Perhaps it would make sense to move this stats logging code to the transport.Handler
, where frontend also logs slow queries. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you ask for "stats" you get both metrics and logging of every query?
The latter seems overkill to me. Why do we want all the 2ms queries in the log file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be useful if you want to bill your customers based on number of queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requirement seems better met by a metric than logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could already be met by setting -frontend.log-queries-longer-than
negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could already be met by setting -frontend.log-queries-longer-than negative.
That would track the HTTP response time but no the wall time aggregated by all queriers (which was the need we had here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could already be met by setting -frontend.log-queries-longer-than negative.
Yes, we also track cortex_query_seconds_total
. Whether the Cortex operator would use metric or log really depends on their use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will make a PR to have a separate option.
Thanks @pstibrany for your review! I should have addressed your comments. |
CHANGELOG.md
Outdated
@@ -9,6 +9,8 @@ | |||
- limit for outgoing gRPC messages has changed from 2147483647 to 16777216 bytes | |||
- limit for incoming gRPC messages has changed from 4194304 to 104857600 bytes | |||
* [FEATURE] Distributor/Ingester: Provide ability to not overflow writes in the presence of a leaving or unhealthy ingester. This allows for more efficient ingester rolling restarts. #3305 | |||
* [FEATURE] Query-frontend: introduced query statistics logged in the query-frontend when enabled via `-frontend.query-stats-enabled=true`. When enabled the following metrics are also tracked: #3539 | |||
- `cortex_query_seconds_total` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain somewhere (e.g. in the docs) how this metric differers from cortex_request_duration_seconds{route="api_prom_api_v1_query_range"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. Done.
Note: prometheus/prometheus#6890 is the right way to track samples and series and is currently blocked because some benchmarks are significantly worse. When we try to implement the series and samples tracking ourselves, we should keep an eye on the benchmarks. |
Agree 💯 on the benchmarks. However, I believe there's no reasonable way to implement it ourselves. I think should be tracked by PromQL engine to get the count done correctly. |
Current collects & logs wall clock time and number of series in the queries. TODO: - collection number of samples - propagate & aggregate in the query frontend - add some metrics. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Only partially done, still need to merge & record the results in the query frontend. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
} | ||
|
||
func (cfg *HandlerConfig) RegisterFlags(f *flag.FlagSet) { | ||
f.DurationVar(&cfg.LogQueriesLongerThan, "frontend.log-queries-longer-than", 0, "Log queries that are slower than the specified duration. Set to 0 to disable. Set to < 0 to enable on all queries.") | ||
f.Int64Var(&cfg.MaxBodySize, "frontend.max-body-size", 10*1024*1024, "Max body size for downstream prometheus.") | ||
f.BoolVar(&cfg.QueryStatsEnabled, "frontend.query-stats-enabled", false, "True to enable query statistics tracking. When enabled, a message with some statistics is logged for every query. This configuration option must be set both on query-frontend and querier.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this get the "frontend" name when it goes on both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now only set on frontend: #3595
What this PR does:
Current collects & logs wall clock time and number of series in the queries.
TODO:
Signed-off-by: Tom Wilkie tom.wilkie@gmail.com
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]