Skip to content

Conversation

@mdbrnowski
Copy link
Member

@mdbrnowski mdbrnowski commented Dec 3, 2025

This PR replaces prometheus_histogram with prometheus_quantile_summary that uses DDSketch algorithm for maintaining pre-defined quantiles with relative-error guarantees.

The Prometheus Erlang client does not implement a sliding time window when calculating quantiles, so this functionality is handled in mongoose_prometheus_sliding_window.erl and mongoose_prometheus_sliding_window_collector.erl.

The predefined quantiles are set to match those used by Exometer, i.e., 0.5, 0.75, 0.90, 0.95, 0.99, and 0.999. The sliding window size is set to 1 minute and is updated every 3 seconds.

@mongoose-im

This comment was marked as outdated.

@codecov
Copy link

codecov bot commented Dec 3, 2025

Codecov Report

❌ Patch coverage is 90.24390% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.08%. Comparing base (ee8e4f6) to head (7d52e9b).
⚠️ Report is 125 commits behind head on master.

Files with missing lines Patch % Lines
.../instrument/mongoose_prometheus_sliding_window.erl 90.22% 13 Missing ⚠️
...t/mongoose_prometheus_sliding_window_collector.erl 87.50% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4588      +/-   ##
==========================================
+ Coverage   86.06%   86.08%   +0.01%     
==========================================
  Files         563      565       +2     
  Lines       33732    33890     +158     
==========================================
+ Hits        29032    29174     +142     
- Misses       4700     4716      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mdbrnowski mdbrnowski force-pushed the prometheus-quantile-summaries branch from d3cc3fe to 0082d31 Compare December 17, 2025 10:55
@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mdbrnowski mdbrnowski force-pushed the prometheus-quantile-summaries branch from cb8f42b to 36bff41 Compare December 19, 2025 10:24
@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mdbrnowski mdbrnowski force-pushed the prometheus-quantile-summaries branch from 9600ec8 to 6f69268 Compare December 19, 2025 12:22
@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

Copy link
Member

@chrzaszcz chrzaszcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments. We decided to replace the metrics with the 60-second ones.

@mongoose-im

This comment was marked as outdated.

@mdbrnowski mdbrnowski force-pushed the prometheus-quantile-summaries branch from 0272fdf to f8bb095 Compare January 21, 2026 11:49
@mongoose-im

This comment was marked as outdated.

@mdbrnowski mdbrnowski force-pushed the prometheus-quantile-summaries branch from f8bb095 to 8c7223b Compare January 21, 2026 13:14
@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mdbrnowski mdbrnowski force-pushed the prometheus-quantile-summaries branch from aee9112 to 185e948 Compare January 22, 2026 10:16
@mongoose-im

This comment was marked as outdated.

use only sliding_window; fix `do_remove` function
@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mdbrnowski mdbrnowski marked this pull request as ready for review January 28, 2026 09:04
@mdbrnowski mdbrnowski requested a review from chrzaszcz January 28, 2026 09:04
Copy link
Member

@chrzaszcz chrzaszcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going in the right direction, but I think it still requires some improvements.

handle_cast(_Msg, State) ->
{noreply, State}.

handle_info({timeout, _TimerRef, rotate}, State) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... what is the point of this? We just receive the message, do nothing, and schedule again?

Key = {Name, LabelValues},
CurrentTime = erlang:monotonic_time(millisecond),
{Windows, CurrentIndex} =
ensure_windows_rotated(Key, CurrentTime, State),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing the point, but IMO we should do time-based rotation.
What I mean is that if the metric is neither updated nor obtained, e.g. for 5 window steps, here we would just update it by 1, right? But we should already have 4 empty windows at this point.
IMO something is wrong here, especially when combined with the unused timer.

I'd like to see unit tests checking that the window is actually sliding correctly with time.

CurrentTime = erlang:monotonic_time(millisecond),
{UpdatedWindows, NewIndex} = ensure_windows_rotated(Key, CurrentTime, State),

%% Update metric state with rotated windows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we split these steps and possibly reuse them between this function and do_observe? It's getting complex, and would be difficult to trace. More functional, less imperative please 🙂

Btw, I'm not convinced we want to rotate here at all (see other comments).

@mongoose-im
Copy link
Collaborator

mongoose-im commented Jan 30, 2026

elasticsearch_and_cassandra_28 / elasticsearch_and_cassandra_mnesia / 7d52e9b
Reports root/ big
OK: 683 / Failed: 0 / User-skipped: 72 / Auto-skipped: 0


small_tests_27 / small_tests / 7d52e9b
Reports root / small


small_tests_28 / small_tests / 7d52e9b
Reports root / small


small_tests_28_arm64 / small_tests / 7d52e9b
Reports root / small


ldap_mnesia_27 / ldap_mnesia / 7d52e9b
Reports root/ big
OK: 2358 / Failed: 0 / User-skipped: 1376 / Auto-skipped: 0


dynamic_domains_mysql_redis_28 / mysql_redis / 7d52e9b
Reports root/ big
OK: 5196 / Failed: 0 / User-skipped: 157 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_27 / pgsql_mnesia / 7d52e9b
Reports root/ big
OK: 5231 / Failed: 0 / User-skipped: 122 / Auto-skipped: 0


internal_mnesia_28 / internal_mnesia / 7d52e9b
Reports root/ big
OK: 2506 / Failed: 0 / User-skipped: 1228 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_28 / pgsql_mnesia / 7d52e9b
Reports root/ big
OK: 5040 / Failed: 3 / User-skipped: 120 / Auto-skipped: 190

graphql_server_SUITE:admin_cli:clustering_tests:remove_alive_from_cluster
{error,{{badrpc,nodedown},
    [{distributed_helper,rpc,
               [#{timeout => 60000,
                node => mongooseim3@localhost},
                mongoose_cluster,join,
                [mongooseim@localhost]],
               [{file,"/home/circleci/project/big_tests/../test/common/distributed_helper.erl"},
                {line,143}]},
     {graphql_server_SUITE,remove_alive_from_cluster,1,
                 [{file,"/home/circleci/project/big_tests/tests/graphql_server_SUITE.erl"},
                {line,214}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1796}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1305}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1237}]}]}}

Report log

graphql_server_SUITE:admin_cli:clustering_tests:remove_node_test
{error,{#{what => invalid_response_code,expected_type => ok,
      response_code => {exit_status,3}},
    [{graphql_helper,assert_response_code,2,
             [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
              {line,258}]},
     {graphql_helper,get_ok_value,2,
             [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
              {line,241}]},
     {graphql_server_SUITE,remove_node_test,1,
                 [{file,"/home/circleci/project/big_tests/tests/graphql_server_SUITE.erl"},
                {line,225}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1796}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1305}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1237}]}]}}

Report log

graphql_server_SUITE:admin_cli:clustering_tests:stop_node_test
{error,{#{what => invalid_response_code,expected_type => ok,
      response_code => {exit_status,3}},
    [{graphql_helper,assert_response_code,2,
             [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
              {line,258}]},
     {graphql_helper,get_ok_value,2,
             [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
              {line,241}]},
     {graphql_server_SUITE,stop_node_test,1,
                 [{file,"/home/circleci/project/big_tests/tests/graphql_server_SUITE.erl"},
                {line,230}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1796}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1305}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1237}]}]}}

Report log

last_SUITE:init_per_suite
{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}

Report log

metrics_api_SUITE:init_per_suite
{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}

Report log

persistent_cluster_id_SUITE:init_per_suite
{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}

Report log

service_domain_db_SUITE:init_per_suite
{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}

Report log

service_mongoose_system_metrics_SUITE:init_per_suite
{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}

Report log

shutdown_SUITE:init_per_suite
{fail,[{validate_node_failed,{badrpc,nodedown},mongooseim3@localhost}]}

Report log


pgsql_cets_28 / pgsql_cets / 7d52e9b
Reports root/ big
OK: 5321 / Failed: 0 / User-skipped: 202 / Auto-skipped: 0


pgsql_mnesia_28 / pgsql_mnesia / 7d52e9b
Reports root/ big
OK: 5624 / Failed: 0 / User-skipped: 142 / Auto-skipped: 0


mysql_redis_28 / mysql_redis / 7d52e9b
Reports root/ big
OK: 5617 / Failed: 0 / User-skipped: 149 / Auto-skipped: 0


pgsql_mnesia_27 / pgsql_mnesia / 7d52e9b
Reports root/ big
OK: 5624 / Failed: 0 / User-skipped: 142 / Auto-skipped: 0


cockroachdb_cets_28 / cockroachdb_cets / 7d52e9b
Reports root/ big
OK: 5325 / Failed: 2 / User-skipped: 202 / Auto-skipped: 0

pubsub_SUITE:dag+last_item_cache:send_last_published_item_no_items_test
{error,
  {timeout_when_waiting_for_stanza,
    [{escalus_client,wait_for_stanza,
       [{client,
          <<"alice_send_last_published_item_no_items_test_3680@localhost/res1">>,
          escalus_tcp,<0.115796.0>,
          [{event_manager,<0.115795.0>},
           {server,<<"localhost">>},
           {username,
             <<"alicE_send_last_published_item_no_items_test_3680">>},
           {resource,<<"res1">>}],
          [{event_client,
             [{event_manager,<0.115795.0>},
            {server,<<"localhost">>},
            {username,
              <<"alicE_send_last_published_item_no_items_test_3680">>},
            {resource,<<"res1">>}]},
           {resource,<<"res1">>},
           {username,
             <<"alice_send_last_published_item_no_items_test_3680">>},
           {server,<<"localhost">>},
           {host,<<"localhost">>},
           {port,5222},
           {auth,fun escalus_auth:auth_plain/2},
           {wspath,undefined},
           {username,
             <<"alicE_send_last_published_item_no_items_test_3680">>},
           {server,<<"localhost">>},
           {password,<<"matygrysa">>},
           {stream_id,<<"41ccf15eb1a9fd71">>}]},
        5000],
       [{file,
          "/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
        {line,136}]},
     {pubsub_tools,receive_response,3,
       [{file,"/home/circleci/project/big_tests/tests/pubsub_tools.erl"},
        {line,444}]},
     {pubsub_tools,receive_and_c...

Report log

pubsub_SUITE:tree+last_item_cache:send_last_published_item_no_items_test
{error,
  {timeout_when_waiting_for_stanza,
    [{escalus_client,wait_for_stanza,
       [{client,
          <<"alice_send_last_published_item_no_items_test_3734@localhost/res1">>,
          escalus_tcp,<0.117081.0>,
          [{event_manager,<0.117070.0>},
           {server,<<"localhost">>},
           {username,
             <<"alicE_send_last_published_item_no_items_test_3734">>},
           {resource,<<"res1">>}],
          [{event_client,
             [{event_manager,<0.117070.0>},
            {server,<<"localhost">>},
            {username,
              <<"alicE_send_last_published_item_no_items_test_3734">>},
            {resource,<<"res1">>}]},
           {resource,<<"res1">>},
           {username,
             <<"alice_send_last_published_item_no_items_test_3734">>},
           {server,<<"localhost">>},
           {host,<<"localhost">>},
           {port,5222},
           {auth,fun escalus_auth:auth_plain/2},
           {wspath,undefined},
           {username,
             <<"alicE_send_last_published_item_no_items_test_3734">>},
           {server,<<"localhost">>},
           {password,<<"matygrysa">>},
           {stream_id,<<"0c4383c9513ab8ce">>}]},
        5000],
       [{file,
          "/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
        {line,136}]},
     {pubsub_tools,receive_response,3,
       [{file,"/home/circleci/project/big_tests/tests/pubsub_tools.erl"},
        {line,444}]},
     {pubsub_tools,receive_and_c...

Report log


dynamic_domains_pgsql_mnesia_28 / pgsql_mnesia / 7d52e9b
Reports root/ big
OK: 5231 / Failed: 0 / User-skipped: 122 / Auto-skipped: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants