Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0.0.M7 “spring-boot-starter-actuator” caused OOM #11338

Closed
chenjinkai opened this issue Dec 13, 2017 · 29 comments
Closed

2.0.0.M7 “spring-boot-starter-actuator” caused OOM #11338

chenjinkai opened this issue Dec 13, 2017 · 29 comments
Labels
type: bug A general bug
Milestone

Comments

@chenjinkai
Copy link

maybe this jar will caused the OOM problem

org.latencyutils LatencyUtils 2.0.3
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Dec 13, 2017
@chenjinkai
Copy link
Author

after my application runned 24 hours, it will be shown.

@chenjinkai
Copy link
Author

dump the memory, amost all objects is instances of org/LatencyUtils/LatencyStats$PauseTracker .

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Dec 13, 2017
@bclozel
Copy link
Member

bclozel commented Dec 13, 2017

Could you report that issue to https://github.com/micrometer-metrics/micrometer ?

@philwebb
Copy link
Member

Possibly related to #11333

@jkschneider
Copy link
Contributor

Possibly related to #11333

Only vaguely in that they are related to the same mechanism.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 13, 2017
@jkschneider
Copy link
Contributor

jkschneider commented Dec 13, 2017

@chenjinkai Are you super positive that this is running on Boot 2 M7 and that you have a version of micrometer >= 1.0.0-rc.4 on your runtime classpath?

We had a memory leak in 1.0.0-rc.3 (Boot 2 M6) that was fixed in rc4. It's not impossible that there is still a leak, but I'm having a difficult time reproducing it.

@jkschneider
Copy link
Contributor

One idea: perhaps you have a custom metric that is running up the cardinality on a particular tag? In effect, you create a separate Timer for every unique combination of name and tags. Since there are by default 5 histograms kept in a ring buffer for every timer with 1 LatencyStats and therefore 1 PauseDetector per histogram, in such a scenario it might appear that PauseDetector is the dominator. If you have 1/5 as many Timer instances as PauseDetector's when the OOM occurs, we can be confident this is a tag cardinality problem and not a general leak in Timer.

@mweirauch
Copy link
Contributor

Here's some plotting of a test app running with micrometer-core:1.0.0-rc.4 (~4 weeks) and rc.5 (~11 hours). No problems observable. Note, this is with a steady amount of meters.

evam-micrometer_1 0 0-rc 4-sb_1 5 8-4w

evam-micrometer_1 0 0-rc 5-sb_1 5 9-10h

Now - as Jon said - produce Timers on a steady basis with changing tag values, actually making them distinct metrics on each construction one can quickly get to OOMs. Testapp with -Xmx64m with Timer created every second with a distinct tag. Does lead to OOM within minutes.
Observed instance count:

micrometer-latencyutils-01

micrometer-latencyutils-02

micrometer-latencyutils-03

I'd second @jkschneider wrt tag cardinality "problem" on user side.

@bclozel
Copy link
Member

bclozel commented Dec 14, 2017

Thanks @mweirauch !

@chenjinka, let us know if you can reproduce this consistently with a recent version and/or if it's actually about creating multiple Timers for tags/name combinations.

@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Dec 14, 2017
@spring-projects-issues
Copy link
Collaborator

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@spring-projects-issues spring-projects-issues added the status: feedback-reminder We've sent a reminder that we need additional information before we can continue label Dec 21, 2017
@spring-projects-issues
Copy link
Collaborator

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.

@spring-projects-issues spring-projects-issues removed status: waiting-for-feedback We need additional information before we can continue status: feedback-reminder We've sent a reminder that we need additional information before we can continue status: waiting-for-triage An issue we've not yet triaged labels Dec 28, 2017
@erlioniel
Copy link

Looks like we encountered the same issue, when migrated our application from M5 to M7. It is stable reproduces twice per day, so I can try to provide more data, but I'm not sure that we will be able to reproduce it in more managed environment.

Anyway I think the issue itself should be reopened and somehow solved.

leak_suspect

@erlioniel
Copy link

Some additional information. We have two services on M7 with the same load. The interesting that the second service is OK, while the first one is stable leaking.

The main difference is that the first service uses RestTemplate to call the second service.

@snicoll
Copy link
Member

snicoll commented Jan 15, 2018

@erlioniel that's interesting. Could you please give it a try with 2.0.0.BUILD-SNAPSHOT? We've upgraded to micrometer 1.0.0-rc.7 in the meantime with specific changes around RestTemplate.

@jkschneider
Copy link
Contributor

jkschneider commented Jan 15, 2018

@erlioniel Also, can you verify the cardinality of all meters you have in your app? A runaway tag value on a timer will cause an OOM eventually.

@erlioniel
Copy link

@snicoll okay, sounds a bit risky but we will try tomorrow.
@jkschneider we are not using any specific meters in our application and I don't see any suspicious objects in the service heap. But maybe you can describe what I can check to be sure?

latency

@jkschneider
Copy link
Contributor

jkschneider commented Jan 15, 2018

If you have 5,850 LatencyStats objects while accepting the defaults elsewhere, you have 1,170 distinct timers living in the app. Assuming this is somehow caused by RestTemplate, let's inspect the tags of that metric:

http://YOURSERVICE/actuator/metrics/http.client.requests

You'll have to specifically enable that actuator endpoint in Boot 2.

Just a guess, but I suspect you have a lot of tags because the app is pre-filling path variables:

restTemplate.getForObject("/test/{id}", String.class, 123); // good
restTemplate.getForObject("/test/123", String.class); // bad

@jkschneider
Copy link
Contributor

jkschneider commented Jan 15, 2018

@snicoll We currently have a MeterFilter#maximumAllowableMetrics(int) concept for an application-wide max. But it would be trivial to add an additional MeterFilter#maximumAllowableMetrics(String, int) to limit tag explosion on a given metric name. And we could add some default cap in RestTemplateMetricsConfiguration to prevent instrumentation from causing memory problems when the RestTemplate interface is misused.

@adriancole FYI. We might consider a similar option in Brave.

@snicoll
Copy link
Member

snicoll commented Jan 16, 2018

when the RestTemplate interface is misused.

I am not sure I agree with that statement. Perhaps we should make the instrumentation of RestTemplate completely optional and clarify in the doc what is required to avoid those side effects?

@erlioniel
Copy link

erlioniel commented Jan 16, 2018

@jkschneider Yes, your assumption was correct. Looks like we are using restTemplate in a wrong way and spring creates a new tag for every single request.

URI uri = fromUri(path)
        .queryParam("param", value)
        .build().encode().toUri();
return restTemplate.getForObject(uri, Entity.class);

Should I check SNAPSHOT version or it is correct behavior of the framework? If it is I think there should be more clear way to avoid the problem like this, because it isn't easy to debug and fix actually.

@erlioniel
Copy link

@jkschneider @snicoll
Btw we are using UriComponentsBuilder because we have to pass array as GET param into the endpoint. So it might be useful to create overload for getForObject() with UriComponentsBuilder (instead of URI) for instrumentation and collecting statistics in cases like we have.

@erlioniel
Copy link

erlioniel commented Jan 16, 2018

I will leave my solution here, just for devs who will look how to disable / change tags for RestTemplate metrics or will fight with the same issue.

@Bean
public RestTemplateExchangeTagsProvider restTemplateTagConfigurer() {
    return new CustomRestTemplateExchangeTagsProvider();
}

private static class CustomRestTemplateExchangeTagsProvider implements RestTemplateExchangeTagsProvider {
    @Override
    public Iterable<Tag> getTags(String urlTemplate, HttpRequest request, ClientHttpResponse response) {
        String url = substringBefore(urlTemplate, "?");
        return asList(
                RestTemplateExchangeTags.method(request),
                RestTemplateExchangeTags.uri(url),
                RestTemplateExchangeTags.status(response),
                RestTemplateExchangeTags.clientName(request));
    }
}

P.S. Will not work with path variables

@wilkinsona
Copy link
Member

@erlioniel So, in your case, it was the query parameters that were causing the problem rather than path variables?

@jkschneider
Copy link
Contributor

If it is I think there should be more clear way to avoid the problem like this, because it isn't easy to debug and fix actually.

Agreed. My plan is to add a MeterFilter limiting the tag cardinality of http.client.requests.

@erlioniel
Copy link

@wilkinsona yes

@erlioniel
Copy link

@jkschneider nice plan btw, is there any issue for that? Just to follow.

@erlioniel
Copy link

Also I will leave this link here :) spring-cloud/spring-cloud-netflix@5a127d0

@jkschneider
Copy link
Contributor

@erlioniel Here is the associated Micrometer issue: micrometer-metrics/micrometer#363

@philwebb
Copy link
Member

philwebb commented Jan 27, 2018

We need to port See 1a9c268 but we first need at least some of #11800.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

9 participants