-
Notifications
You must be signed in to change notification settings - Fork 438
Prometheus quantile summaries #4588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4588 +/- ##
==========================================
+ Coverage 86.06% 86.08% +0.01%
==========================================
Files 563 565 +2
Lines 33732 33890 +158
==========================================
+ Hits 29032 29174 +142
- Misses 4700 4716 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
d3cc3fe to
0082d31
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
cb8f42b to
36bff41
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
9600ec8 to
6f69268
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
chrzaszcz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments. We decided to replace the metrics with the 60-second ones.
src/instrument/mongoose_prometheus_sliding_window_collector.erl
Outdated
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
0272fdf to
f8bb095
Compare
This comment was marked as outdated.
This comment was marked as outdated.
f8bb095 to
8c7223b
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
aee9112 to
185e948
Compare
This comment was marked as outdated.
This comment was marked as outdated.
use only sliding_window; fix `do_remove` function
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
chrzaszcz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going in the right direction, but I think it still requires some improvements.
| handle_cast(_Msg, State) -> | ||
| {noreply, State}. | ||
|
|
||
| handle_info({timeout, _TimerRef, rotate}, State) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... what is the point of this? We just receive the message, do nothing, and schedule again?
| Key = {Name, LabelValues}, | ||
| CurrentTime = erlang:monotonic_time(millisecond), | ||
| {Windows, CurrentIndex} = | ||
| ensure_windows_rotated(Key, CurrentTime, State), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing the point, but IMO we should do time-based rotation.
What I mean is that if the metric is neither updated nor obtained, e.g. for 5 window steps, here we would just update it by 1, right? But we should already have 4 empty windows at this point.
IMO something is wrong here, especially when combined with the unused timer.
I'd like to see unit tests checking that the window is actually sliding correctly with time.
| CurrentTime = erlang:monotonic_time(millisecond), | ||
| {UpdatedWindows, NewIndex} = ensure_windows_rotated(Key, CurrentTime, State), | ||
|
|
||
| %% Update metric state with rotated windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we split these steps and possibly reuse them between this function and do_observe? It's getting complex, and would be difficult to trace. More functional, less imperative please 🙂
Btw, I'm not convinced we want to rotate here at all (see other comments).
|
elasticsearch_and_cassandra_28 / elasticsearch_and_cassandra_mnesia / 7d52e9b small_tests_27 / small_tests / 7d52e9b small_tests_28 / small_tests / 7d52e9b small_tests_28_arm64 / small_tests / 7d52e9b ldap_mnesia_27 / ldap_mnesia / 7d52e9b dynamic_domains_mysql_redis_28 / mysql_redis / 7d52e9b dynamic_domains_pgsql_mnesia_27 / pgsql_mnesia / 7d52e9b internal_mnesia_28 / internal_mnesia / 7d52e9b dynamic_domains_pgsql_mnesia_28 / pgsql_mnesia / 7d52e9b graphql_server_SUITE:admin_cli:clustering_tests:remove_alive_from_cluster{error,{{badrpc,nodedown},
[{distributed_helper,rpc,
[#{timeout => 60000,
node => mongooseim3@localhost},
mongoose_cluster,join,
[mongooseim@localhost]],
[{file,"/home/circleci/project/big_tests/../test/common/distributed_helper.erl"},
{line,143}]},
{graphql_server_SUITE,remove_alive_from_cluster,1,
[{file,"/home/circleci/project/big_tests/tests/graphql_server_SUITE.erl"},
{line,214}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1796}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1305}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1237}]}]}}graphql_server_SUITE:admin_cli:clustering_tests:remove_node_test{error,{#{what => invalid_response_code,expected_type => ok,
response_code => {exit_status,3}},
[{graphql_helper,assert_response_code,2,
[{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
{line,258}]},
{graphql_helper,get_ok_value,2,
[{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
{line,241}]},
{graphql_server_SUITE,remove_node_test,1,
[{file,"/home/circleci/project/big_tests/tests/graphql_server_SUITE.erl"},
{line,225}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1796}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1305}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1237}]}]}}graphql_server_SUITE:admin_cli:clustering_tests:stop_node_test{error,{#{what => invalid_response_code,expected_type => ok,
response_code => {exit_status,3}},
[{graphql_helper,assert_response_code,2,
[{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
{line,258}]},
{graphql_helper,get_ok_value,2,
[{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
{line,241}]},
{graphql_server_SUITE,stop_node_test,1,
[{file,"/home/circleci/project/big_tests/tests/graphql_server_SUITE.erl"},
{line,230}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1796}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1305}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1237}]}]}}last_SUITE:init_per_suite{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}metrics_api_SUITE:init_per_suite{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}persistent_cluster_id_SUITE:init_per_suite{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}service_domain_db_SUITE:init_per_suite{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}service_mongoose_system_metrics_SUITE:init_per_suite{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}shutdown_SUITE:init_per_suite{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}pgsql_cets_28 / pgsql_cets / 7d52e9b pgsql_mnesia_28 / pgsql_mnesia / 7d52e9b mysql_redis_28 / mysql_redis / 7d52e9b pgsql_mnesia_27 / pgsql_mnesia / 7d52e9b cockroachdb_cets_28 / cockroachdb_cets / 7d52e9b pubsub_SUITE:dag+last_item_cache:send_last_published_item_no_items_test{error,
{timeout_when_waiting_for_stanza,
[{escalus_client,wait_for_stanza,
[{client,
<<"alice_send_last_published_item_no_items_test_3680@localhost/res1">>,
escalus_tcp,<0.115796.0>,
[{event_manager,<0.115795.0>},
{server,<<"localhost">>},
{username,
<<"alicE_send_last_published_item_no_items_test_3680">>},
{resource,<<"res1">>}],
[{event_client,
[{event_manager,<0.115795.0>},
{server,<<"localhost">>},
{username,
<<"alicE_send_last_published_item_no_items_test_3680">>},
{resource,<<"res1">>}]},
{resource,<<"res1">>},
{username,
<<"alice_send_last_published_item_no_items_test_3680">>},
{server,<<"localhost">>},
{host,<<"localhost">>},
{port,5222},
{auth,fun escalus_auth:auth_plain/2},
{wspath,undefined},
{username,
<<"alicE_send_last_published_item_no_items_test_3680">>},
{server,<<"localhost">>},
{password,<<"matygrysa">>},
{stream_id,<<"41ccf15eb1a9fd71">>}]},
5000],
[{file,
"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
{line,136}]},
{pubsub_tools,receive_response,3,
[{file,"/home/circleci/project/big_tests/tests/pubsub_tools.erl"},
{line,444}]},
{pubsub_tools,receive_and_c...pubsub_SUITE:tree+last_item_cache:send_last_published_item_no_items_test{error,
{timeout_when_waiting_for_stanza,
[{escalus_client,wait_for_stanza,
[{client,
<<"alice_send_last_published_item_no_items_test_3734@localhost/res1">>,
escalus_tcp,<0.117081.0>,
[{event_manager,<0.117070.0>},
{server,<<"localhost">>},
{username,
<<"alicE_send_last_published_item_no_items_test_3734">>},
{resource,<<"res1">>}],
[{event_client,
[{event_manager,<0.117070.0>},
{server,<<"localhost">>},
{username,
<<"alicE_send_last_published_item_no_items_test_3734">>},
{resource,<<"res1">>}]},
{resource,<<"res1">>},
{username,
<<"alice_send_last_published_item_no_items_test_3734">>},
{server,<<"localhost">>},
{host,<<"localhost">>},
{port,5222},
{auth,fun escalus_auth:auth_plain/2},
{wspath,undefined},
{username,
<<"alicE_send_last_published_item_no_items_test_3734">>},
{server,<<"localhost">>},
{password,<<"matygrysa">>},
{stream_id,<<"0c4383c9513ab8ce">>}]},
5000],
[{file,
"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
{line,136}]},
{pubsub_tools,receive_response,3,
[{file,"/home/circleci/project/big_tests/tests/pubsub_tools.erl"},
{line,444}]},
{pubsub_tools,receive_and_c...dynamic_domains_pgsql_mnesia_28 / pgsql_mnesia / 7d52e9b |
This PR replaces
prometheus_histogramwithprometheus_quantile_summarythat uses DDSketch algorithm for maintaining pre-defined quantiles with relative-error guarantees.The Prometheus Erlang client does not implement a sliding time window when calculating quantiles, so this functionality is handled in
mongoose_prometheus_sliding_window.erlandmongoose_prometheus_sliding_window_collector.erl.The predefined quantiles are set to match those used by Exometer, i.e., 0.5, 0.75, 0.90, 0.95, 0.99, and 0.999. The sliding window size is set to 1 minute and is updated every 3 seconds.