Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HighCardinalityTagsDetector #3047

Merged

Conversation

jonatan-ivanov
Copy link
Member

@jonatan-ivanov jonatan-ivanov commented Mar 1, 2022

Draft solution for detecting high cardinality tags:

@jonatan-ivanov jonatan-ivanov marked this pull request as draft March 1, 2022 22:23
@jjank
Copy link
Contributor

jjank commented Apr 22, 2022

Just a thought/question I had while looking at your draft:

There is the MeterRegistry.Config#onMeterAdded callback. Could that also be an entrypoint for detecting high-cardinality tags? Some advantages might be:

  • resource-friendly
  • possible more user friendly too:
    • Immediate feedback when a meter is added
    • easier to configure - no polling interval
    • no need to shutdown anything

Another random thought: Should we give the users the ability to implement a custom strategy? The micrometer default could be logging (like in your draft) but some users may want to be more strict and e.g. throw a runtime exception?

Thoughts @jonatan-ivanov ?

@jonatan-ivanov
Copy link
Member Author

@wengertj Yes, I think it can be used for this purposes.
I was thinking about using it but the trade-off between polling and using the listener is performance vs. convenience. I agree with you, adding this as a listener would be simpler, more convenient and it would look way nicer but on the other hand every time a new meter is created, you need to go through all of the meters that you have created so far to do the check and sometimes you can have lots of metrics. The polling solution on the other hand will do the check on an interval (e.g.: every 5m) but it will do the check even if you did not add a single meter (this can be fixed by memorizing the number of meters).

@jjank
Copy link
Contributor

jjank commented Apr 27, 2022

Fair points 👍
I think it somewhat comes down to the question how opinionated micrometer wants to be about this - i.e will the detector be enabled by default or not.

Related to that: I think that it would be nice if we had an easy to discover API to enable or disable the detector (in addition to documentation). Maybe something along the lines of:

Metrics.globalRegistry.withDefaultHighCardinalityTagsDetector()

EDIT: You already mentioned that yourself "We might want this to be configurable from the MeterRegistry" 🙈

If we stick to the polling solution we might consider implementing AutoCloseable and shutdown the polling executor in close. This gives users a clear indication that there's resources that need to be closed properly.

@jonatan-ivanov
Copy link
Member Author

👍🏼 for AutoCloseable.

@jonatan-ivanov
Copy link
Member Author

I did some measurements with a Spring Boot based web service (resourceater) that creates meters with high cardinality tags.

Environment

Java 18, macOS (x86_64)

 ❯ java --version
openjdk 18 2022-03-22
OpenJDK Runtime Environment Temurin-18+36 (build 18+36)
OpenJDK 64-Bit Server VM Temurin-18+36 (build 18+36, mixed mode, sharing)
 ❯ sw_vers
ProductName:	macOS
ProductVersion:	12.4

Code

You can see how meters are created in MeterResource and how I start the app in the Makefile (-XX:MinHeapSize=1G -XX:MaxHeapSize=1G).

Measurements

The scenario was alway the same for all of the measurements:

  • Start the app
  • "Warm it up" by creating a single meter
  • Triggering GC from VisualVM
  • Creating meters
  • Triggering GC from VisualVM

So in the heap graphs you see some baseline, then a drop: first GC to get the real baseline, then a slope: creating meters, then another drop (second GC to see the real utilization). The amount of heap used by the meters is basically the difference between the baseline after the drops.

I did not capture a heap dump or checked the exact numbers, I only wanted to know the approximate usage.

1 million Counters

As you can see creating 1 million Counters occupied a little less than 300MiB from heap.
1M-counters

1 million Timers

As you can see creating 1 million Timers occupied approximately 550MiB from the heap.
1M-timers

500k Counters plus 500k Timers

As you can see creating 500k Counters and 500k Timers occupied approximately 400MiB from heap.
500K-conters-and-500K-timers

2 million Timers

java.lang.OutOfMemoryError: Java heap space

1.5 million Timers

As you can see creating 1.5 million Timers occupied approximately 800MiB from the heap.

1 5M-timers

I think based on these measurements, setting the default threshold to 1M for the HighCardinalityTagsDetector can be an ok start since it's high enough to be an indicator that something is wrong but low enough that some apps with some breathing room have a chance to survive.

@jonatan-ivanov jonatan-ivanov marked this pull request as ready for review May 21, 2022 22:25
@jonatan-ivanov
Copy link
Member Author

I'm thinking if this should be opt-in or opt-out. Right now it is opt-in.

@jonatan-ivanov jonatan-ivanov requested a review from shakuzen May 21, 2022 22:27
@jonatan-ivanov jonatan-ivanov added enhancement A general enhancement module: micrometer-core An issue that is related to our core module labels May 21, 2022
@shakuzen
Copy link
Member

  • We might want to do some measurements to set a default threshold

Since long ago, I've wanted to have some warning if memory used by meters exceeds a configurable % of heap memory. We could do something like that here, but maybe it doesn't make as much sense here and we should consider that separately. It just seems like we would want to warn about high cardinality before an OOM happens. The problem with an absolute threshold is that for low memory applications it will be too high. For large, monolithic applications with few instances, maybe it will be too low. It is configurable, so it probably isn't a problem. I'm just thinking how we can have a better default.

@jonatan-ivanov
Copy link
Member Author

jonatan-ivanov commented May 25, 2022

Since long ago, I've wanted to have some warning if memory used by meters exceeds a configurable % of heap memory. We could do something like that here, but maybe it doesn't make as much sense here and we should consider that separately. It just seems like we would want to warn about high cardinality before an OOM happens.

I think we can prepare this for that use-case too, e.g.: renaming it to something more general: AnomalyDetector and making it possible to register event handlers so that you can get a notification once something bad happened. Though we need to provide a way to inject configuration for the code that does the checks (or make the code itself injectable).

The problem with an absolute threshold is that for low memory applications it will be too high. For large, monolithic applications with few instances, maybe it will be too low. It is configurable, so it probably isn't a problem. I'm just thinking how we can have a better default.

I'm going to check if I can figure out some usable heuristics and make the default a function of the max heap.

@jonatan-ivanov
Copy link
Member Author

I added a simple heuristics for the threshold based on the max heap.
I also set a maximum value for this which means no matter how much heap you have, if two million meters will share the same name, this thing will scream by default. If users decide that 2M+ is ok and they have enough heap, they can configure a value they like.

@jonatan-ivanov jonatan-ivanov added this to the 1.10.0-M3 milestone Jun 15, 2022
Copy link
Contributor

@marcingrzejszczak marcingrzejszczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Just one comment about the default threshold calculation based on the heap.

@shakuzen shakuzen merged commit 3697478 into micrometer-metrics:main Jun 22, 2022
@jonatan-ivanov jonatan-ivanov deleted the high-cardinality-tags-detector branch June 22, 2022 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A general enhancement module: micrometer-core An issue that is related to our core module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants