Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Use exponential buckets for histogram metrics #1545

Merged
merged 5 commits into from
Nov 14, 2019

Conversation

kakkoyun
Copy link
Member

@kakkoyun kakkoyun commented Sep 19, 2019

This PR changes existing bucket configurations to fix issues that observed with latency graphs.

For example, as you can observe there are large differences between mean and P50 latencies.

  • thanos_compact_garbage_collection_duration_seconds_bucket

Screenshot 2019-09-25 16 55 49

  • thanos_compact_sync_meta_duration_seconds_bucket

Screenshot 2019-09-25 16 57 43

This increases the number of buckets for most of the histograms. For certain metrics, it significantly affects cardinality. However, it's needed to properly instrument the components.

Changes

Uses exponential buckets to provide more even distribution. (number of buckets, before and after)

  • grpc_server_handling_seconds_bucket : 10 -> 15 (+exposes multiple labels)
  • http_request_duration_seconds_bucket : 11 -> 17 (+exposes 3 labels, code, method, handler)
  • thanos_compact_sync_meta_duration_seconds_bucket : 14 -> 15
  • thanos_compact_garbage_collection_duration_seconds_bucket : 14 -> 15
  • thanos_objstore_bucket_operation_duration_seconds_bucket : 15 -> 17
  • thanos_bucket_store_series_get_all_duration_seconds_bucket : 14 -> 15
  • thanos_bucket_store_series_gate_duration_seconds_bucket : 14 -> 15
  • thanos_bucket_store_series_merge_duration_seconds_bucket : 10 -> 15

Verification

  1. make test
  2. Run MINIO_ENABLED=1 ./scripts/quickstart.sh and curl to /metrics.

@GiedriusS
Copy link
Member

@kakkoyun how is this PR going?

@kakkoyun
Copy link
Member Author

kakkoyun commented Nov 9, 2019

@GiedriusS I had to park this one for a while. But I haven't abandoned it, I'll have another look at it soon. I have also discovered similar issues with Store GW histograms, I may include those improvements in this PR as well.

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
@kakkoyun kakkoyun changed the title compactor: Use exponential buckets for histogram metrics *: Use exponential buckets for histogram metrics Nov 13, 2019
grpc_prometheus.WithHistogramBuckets([]float64{
0.001, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4,
}),
grpc_prometheus.WithHistogramBuckets(prometheus.ExponentialBuckets(0.001, 2, 15)),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:

grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.001"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.01"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.05"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.1"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.2"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.4"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.8"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="1.6"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="3.2"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="6.4"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="+Inf"} 0

After:

grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.001"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.002"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.004"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.008"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.016"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.032"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.064"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.128"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.256"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.512"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="1.024"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="2.048"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="4.096"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="8.192"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="16.384"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="+Inf"} 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example distirbution for existing buckets, from a real life system.

sum(grpc_server_handling_seconds_bucket{job=~"thanos-store.*", grpc_type="server_stream"}) by (le)
{le="6.4"} | 158
{le="0.05"} | 2
{le="0.1"} | 5
{le="0.2"} | 13
{le="0.4"} | 34
{le="0.8"} | 62
{le="+Inf"} | 187
{le="0.001"} | 0
{le="0.01"} | 0
{le="1.6"} | 103
{le="3.2"} | 133

},
Name: "gate_duration_seconds",
Help: "How many seconds it took for queries to wait at the gate.",
Buckets: prometheus.ExponentialBuckets(0.001, 2, 15),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example distirbution for existing buckets, from a real life system.

sum(thanos_bucket_store_series_gate_duration_seconds_bucket{job="thanos-store"}) by (le)
{le="10"} | 0
{le="5"} | 0
{le="+Inf"} | 187
{le="0.6"} | 0
{le="1"} | 0
{le="0.25"} | 0
{le="2"} | 0
{le="3.5"} | 0
{le="0.01"} | 0
{le="0.05"} | 0
{le="0.1"} | 0

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
@kakkoyun kakkoyun marked this pull request as ready for review November 13, 2019 19:08
@kakkoyun
Copy link
Member Author

@brancz
Copy link
Member

brancz commented Nov 14, 2019

I’m expecting that we will need even higher buckets, but this is better than what we have and will clarify the need for more, so lgtm.

@brancz brancz merged commit a3ab545 into thanos-io:master Nov 14, 2019
@bwplotka
Copy link
Member

I think higher buckets has to depend on query timeout, so probably we need higher ones, but do we need so many lower level buckets?

Do we really care if we have a request going 0.001 (seconds!) or 0.128 seconds? :thinking_face:

@kakkoyun kakkoyun deleted the histogram_buckets branch November 14, 2019 09:58
@kakkoyun
Copy link
Member Author

I'm happy to re-address all the issues after we know more about distribution. What we have does not provide much, I can do another iteration to tune them.

tianyuansun pushed a commit to tianyuansun/thanos that referenced this pull request Nov 19, 2019
* Use exponential buckets for compactor histogram metrics

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Update buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust histogram buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust store gate bucket

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust http duration buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: suntianyuan <suntianyuan@baidu.com>
@kakkoyun kakkoyun mentioned this pull request Nov 20, 2019
2 tasks
IKSIN pushed a commit to monitoring-tools/thanos that referenced this pull request Nov 26, 2019
* Use exponential buckets for compactor histogram metrics

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Update buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust histogram buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust store gate bucket

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust http duration buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Aleksey Sin <asin@ozon.ru>
IKSIN pushed a commit to monitoring-tools/thanos that referenced this pull request Nov 27, 2019
* Use exponential buckets for compactor histogram metrics

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Update buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust histogram buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust store gate bucket

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Adjust http duration buckets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Aleksey Sin <asin@ozon.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants