Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when using more than one MetricReader #4115

Closed
skadefro opened this issue Sep 4, 2023 · 4 comments · Fixed by #4163
Closed

Memory leak when using more than one MetricReader #4115

skadefro opened this issue Sep 4, 2023 · 4 comments · Fixed by #4163
Assignees
Labels
bug Something isn't working has:reproducer This bug/feature has a minimal reproducer provided priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc

Comments

@skadefro
Copy link

skadefro commented Sep 4, 2023

What happened?

Steps to Reproduce

Add two addMetricReader's to a meterProvider and see memory explode

Expected Result

A reasonable amount of memory usage, preferably the same as only having one

Actual Result

a very fast groth of ram usage

Additional Details

Using the @opentelemetry/api 1.4.1 i began seeing massive spikes in memory usage, but only on some systems, after some digging around, i found it was only on systems with more than one opentelemetry collector.
That has not been an issue before, so i down graded to

"@opentelemetry/api-metrics": "^0.33.0",
"@opentelemetry/exporter-metrics-otlp-grpc": "^0.34.0",
"@opentelemetry/host-metrics": "^0.33.1",
"@opentelemetry/sdk-metrics-base": "^0.31.0",
"@opentelemetry/sdk-node": "^0.34.0",

that cut the memory usage in half, but it was it is still leaking memory
The graph below is memory usage for 3 runs.
First run is with newest package versions
Second run is with older package versions
last run is with only one MetricReader
image

While gathering data and graphs for this issue, i had the idea of removing AsyncHooksContextManager since I'm not really using that right now.
I tried adding 2 MetricReader, update to latest version of the packages, but remove AsyncHooksContextManager. That seems to reduce the memory usage drastically, but I also see memory usage is slowly creeping up over time, so it did not solve the issue. ( here compared with the above graphs )
image

The project this is code is running in, is super basic. It will start 1 or 2 new processes, and restarting them, if they die. so it should not really be using any ram og cpu.

OpenTelemetry Setup Code

import { api, node, tracing } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Context, context, ROOT_CONTEXT, SpanContext, TraceFlags, trace } from '@opentelemetry/api';
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { HostMetrics } from '@opentelemetry/host-metrics';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { AsyncHooksContextManager } from '@opentelemetry/context-async-hooks'

// static functions on instrumentation class
    public static addMeterURL(url: string) {
        const statMetricReader = new PeriodicExportingMetricReader({
            exporter: new OTLPMetricExporter({
                url: url
            }),
            exportIntervalMillis: 3000,
        });
        instrumentation.meterProvider.addMetricReader(statMetricReader);
    }
    public static addTraceURL(url: string) {
        const stateTraceExporter = new OTLPTraceExporter({ url });
        instrumentation.traceProvider.addSpanProcessor(new tracing.BatchSpanProcessor(stateTraceExporter, { maxQueueSize: instrumentation.maxQueueSize }))
    }


// init in instrumentation class
        const contextManager = new AsyncHooksContextManager().enable();
        context.setGlobalContextManager(contextManager)

        instrumentation.traceProvider = new node.NodeTracerProvider({
            resource: instrumentation.resource
        })
        api.trace.setGlobalTracerProvider(instrumentation.traceProvider)
        instrumentation.meterProvider = new MeterProvider({
            resource: instrumentation.resource
        });
        api.metrics.setGlobalMeterProvider(instrumentation.meterProvider);
        if (process.env.enable_analytics != "false") {
            instrumentation.addMeterURL(instrumentation.baseurl)
            instrumentation.addTraceURL(instrumentation.baseurl)
        }
        if (process.env.otel_metric_url != null && process.env.otel_metric_url != "") {
            instrumentation.addMeterURL(process.env.otel_metric_url);
        }
        if (process.env.otel_trace_url != null && process.env.otel_trace_url != "") {
            instrumentation.addTraceURL(process.env.otel_trace_url);
        }
        const hostMetrics = new HostMetrics({
            meterProvider: instrumentation.meterProvider,
            name: "nodeagent"
        });
        if (process.env.otel_log_level != null && process.env.otel_log_level != "") {
            console.log("Starting HostMetrics");
        }
        hostMetrics.start();
        instrumentation.creatememorymeters()
        process.on('SIGINT', async () => {
            try {
                if (instrumentation.traceProvider != null) {
                    await instrumentation.traceProvider.shutdown();
                    instrumentation.traceProvider = null;
                }
                if (instrumentation.meterProvider != null) {
                    await instrumentation.meterProvider.shutdown();
                    instrumentation.meterProvider = null;
                }
            } catch (error) {
                console.error(error);
            } finally {
            }
        });

package.json

"@opentelemetry/api": "^1.4.1",
"@opentelemetry/auto-instrumentations-node": "^0.39.2",
"@opentelemetry/exporter-metrics-otlp-grpc": "^0.41.2",
"@opentelemetry/host-metrics": "^0.33.1",

Relevant log output

No response

@skadefro skadefro added bug Something isn't working triage labels Sep 4, 2023
@skadefro
Copy link
Author

skadefro commented Sep 4, 2023

I could not for the life of me, remember why i never had this issue before, so had a look in some old code i wrote a few years back that also implemented otel. And back then I wrote my own MultiSpanExporter and MultiMetricExporter classes, so I am not registering multiple exporters, I simply handle doing the export from multiple exports, in a wrapper class.
I guess I will just copy this classes over, and do that again, since i see others reported this issue and it does not look like something that is going to be fixed any time soon.

@pichlermarc pichlermarc added priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc and removed triage labels Sep 5, 2023
@pichlermarc
Copy link
Member

pichlermarc commented Sep 5, 2023

I created a minimal reproducer repository for it.
https://github.com/pichlermarc/repro-4115

It looks like a memory leak is in the metric collection/export pipeline; memory leaks quicker when the collection interval is shorter.

@pichlermarc pichlermarc added the has:reproducer This bug/feature has a minimal reproducer provided label Sep 5, 2023
@pichlermarc pichlermarc self-assigned this Sep 25, 2023
@pichlermarc
Copy link
Member

Self-assigning this because I think I've finally gotten to the bottom of this.

It looks like we're keeping track of unreported accumulations in TemporalMetricProcessor even if the storage is only intended for one collector. (see here, where a storage is returned only for one collector). This causes _unreportedAccumulations map to keep track of every change for n-1 collectors without ever resetting.

I've tried just collecting with one MetricCollectorHandle, which fixes the memory leaks, but that messes with Views and drops data from the first collection cycle for some readers. Fixing this will likely be non-trivial.

I'll see to have a PR ready for this by the end of this week/early next week.

@skadefro
Copy link
Author

skadefro commented Sep 26, 2023

For anyone affected by this, here is a very crude work around ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working has:reproducer This bug/feature has a minimal reproducer provided priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc
Projects
None yet
2 participants