Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumentation Double-Init / Caching Bug w/ Prometheus Exporter #3472

Closed
mothershipper opened this issue Dec 7, 2022 · 10 comments · Fixed by #3488
Closed

Instrumentation Double-Init / Caching Bug w/ Prometheus Exporter #3472

mothershipper opened this issue Dec 7, 2022 · 10 comments · Fixed by #3488
Assignees
Labels
bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect signal:metrics Issues and PRs related to general metrics signal

Comments

@mothershipper
Copy link
Contributor

mothershipper commented Dec 7, 2022

What happened?

This is a bit hard for me to extract our exact setup, but roughly following the docs for the autoloader:

const provider = new MeterProvider();
provider.addMetricReader(new PrometheusExporter());
api.metrics.setGlobalMeterProvider(provider);
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
  ],
});

When we run the following, prometheus will end up exporting the http metrics twice:

# HELP http_server_duration measures the duration of the inbound HTTP requests
# UNIT http_server_duration ms
# TYPE http_server_duration histogram
# HELP http_server_duration measures the duration of the inbound HTTP requests
# UNIT http_server_duration ms
# TYPE http_server_duration histogram
# HELP http_client_duration measures the duration of the outbound HTTP requests
# UNIT http_client_duration ms
# TYPE http_client_duration histogram
# HELP http_client_duration measures the duration of the outbound HTTP requests
# UNIT http_client_duration ms
# TYPE http_client_duration histogram

I added some debug code to the exporter:

   for (let metric of metrics.scopeMetrics) {
     console.log(`scope: ${metric.name}`)
      for (const data of metric.metrics) {
        console.log(data.descriptor)
    }

and I see the following logs:

scope: @opentelemetry/instrumentation-http
{
  name: 'http.server.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the inbound HTTP requests',
  unit: 'ms',
  valueType: 1
}
{
  name: 'http.server.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the inbound HTTP requests',
  unit: 'ms',
  valueType: 1
}
{
  name: 'http.client.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the outbound HTTP requests',
  unit: 'ms',
  valueType: 1
}
{
  name: 'http.client.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the outbound HTTP requests',
  unit: 'ms',
  valueType: 1
}

I think this is because the Instrumentation constructor inits the metrics first, then the call to registerInstrumentation passes in the same meterProvider to enableInstrumentations, triggering the instrumentation to re-create the metrics against the original meter/meterProvider.

Not sure what the "bug" is here, guessing it's either:

  1. Creating the metrics in the constructor, then re-initing with the same meter a little later in the autoloader
  2. Not caching/returning the cached metric when the meter hasn't changed
  3. Not de-duping these metrics in the prometheus exporter

OpenTelemetry Setup Code

No response

package.json

"@opentelemetry/api": "^1.3.0",
    "@opentelemetry/exporter-prometheus": "^0.34.0",
    "@opentelemetry/instrumentation-http": "^0.34.0",
    "@opentelemetry/sdk-metrics": "^1.8.0",

Relevant log output

No response

@mothershipper mothershipper added bug Something isn't working triage labels Dec 7, 2022
@mothershipper
Copy link
Contributor Author

Could also be user error / us looking at outdated docs -- we've since stopped calling registerInstrumentation, construction of the plugin seems to be sufficient to monkeypatch.

@dyladan dyladan added priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect and removed triage labels Dec 7, 2022
@dyladan
Copy link
Member

dyladan commented Dec 7, 2022

Assigned @hectorhdzg as he wrote the http metric instrumentation

@mothershipper
Copy link
Contributor Author

@dyladan I think this would affect anything that extends InstrumentationAbstract, not necessarily specific to the http instrumentation

@mothershipper
Copy link
Contributor Author

We just don't enable much else by default to have noticed/verified it affects others.

@hectorhdzg
Copy link
Member

Currently we call _updateMetricInstruments whenever an "Instrumentation" class is instantiated, this method will create the "Metric Instruments" through a Meter obtained using the current MeterProvider, so if the "Instrumentation" is enabled through config(enabled by default) when instantiated then it will be fully functional, registerInstrumentations will update the MeterProvider in this case using exact same one that was being used before and the "Instrumentation" will create the "Metrics Instruments" again, I believe the correct fix here would be to add extra logic in the Instrumentation class when setMeterProvider is called, to check if the Meter is different or not, and only create the "Metric Instruments" when is something different. @legendecas @dyladan thoughts?, I can implement the fix just getting some consensus on what the correct fix would be

@mothershipper
Copy link
Contributor Author

check if the Meter is different or not, and only create the "Metric Instruments" when is something different

This would get my vote, though it does mean the API would still allow end-users to double-create metric instruments on their own vs. caching and returning the same metric instrument from the meter.

@legendecas
Copy link
Member

legendecas commented Dec 13, 2022

Would you mind sharing the exact metrics output that you think contains duplicated metric records? I can not reproduce the problem with the following setup:

const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const api = require('@opentelemetry/api');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');

api.diag.setLogger(new api.DiagConsoleLogger(), { logLevel: api.DiagLogLevel.ALL });

const provider = new MeterProvider();
provider.addMetricReader(new PrometheusExporter());
api.metrics.setGlobalMeterProvider(provider);

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
  ],
});

const http = require('http');
http.createServer((req, res) => {
    res.end('hello');
}).listen(1234);
{
  "dependencies": {
    "@opentelemetry/api": "^1.3.0",
    "@opentelemetry/exporter-prometheus": "^0.34.0",
    "@opentelemetry/instrumentation": "^0.34.0",
    "@opentelemetry/instrumentation-http": "^0.34.0",
    "@opentelemetry/sdk-metrics": "^1.8.0"
  }
}

Although "same metric names" can exist in the output when you request http://localhost:9464/metrics, they are not "identical" metric records since their attributes (or labels in prometheus) are not the same.

Duplicated metric instrument registrations are not created multiple times. Instead, the returned metric instruments are sharing the same metric storages (https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/sdk-metrics/src/state/MeterSharedState.ts#L128). So I don't find this is a problem with _updateMetricInstruments.

@legendecas legendecas added the signal:metrics Issues and PRs related to general metrics signal label Dec 13, 2022
@mothershipper
Copy link
Contributor Author

@legendecas thank you for the base code, this should show the duplicate metrics/series:

const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const api = require('@opentelemetry/api');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');

api.diag.setLogger(new api.DiagConsoleLogger(), { logLevel: api.DiagLogLevel.ALL });

const provider = new MeterProvider();


const exporter = new PrometheusExporter();
const old = exporter._exportMetrics
exporter._exportMetrics = (response) => {
  exporter.collect().then((collectionResult) => {
    const { resourceMetrics, errors } = collectionResult;
    for (const metric of resourceMetrics.scopeMetrics) {
      console.log(`scope: ${metric.scope.name}`);
      console.log(metric);
      for (const series of metric.metrics) {
        console.log(series.descriptor);
      }
    }
  }) 
  old(response);
}

provider.addMetricReader(exporter);
api.metrics.setGlobalMeterProvider(provider);

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
  ],
});

const http = require('http');
http.createServer((req, res) => {
    res.end('hello');
}).listen(1234);

Running this I see:

scope: @opentelemetry/instrumentation-http
{
  scope: {
    name: '@opentelemetry/instrumentation-http',
    version: '0.34.0',
    schemaUrl: undefined
  },
  metrics: [
    {
      descriptor: [Object],
      aggregationTemporality: 1,
      dataPointType: 0,
      dataPoints: []
    },
    {
      descriptor: [Object],
      aggregationTemporality: 1,
      dataPointType: 0,
      dataPoints: []
    },
    {
      descriptor: [Object],
      aggregationTemporality: 1,
      dataPointType: 0,
      dataPoints: []
    },
    {
      descriptor: [Object],
      aggregationTemporality: 1,
      dataPointType: 0,
      dataPoints: []
    }
  ]
}
{
  name: 'http.server.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the inbound HTTP requests',
  unit: 'ms',
  valueType: 1
}
{
  name: 'http.server.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the inbound HTTP requests',
  unit: 'ms',
  valueType: 1
}
{
  name: 'http.client.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the outbound HTTP requests',
  unit: 'ms',
  valueType: 1
}
{
  name: 'http.client.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the outbound HTTP requests',
  unit: 'ms',
  valueType: 1
}

As far as I can tell, all those histograms are the same labels. Run the following (removes the call to registerInstruments):

const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const api = require('@opentelemetry/api');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');

api.diag.setLogger(new api.DiagConsoleLogger(), { logLevel: api.DiagLogLevel.ALL });

const provider = new MeterProvider();


const exporter = new PrometheusExporter();
const old = exporter._exportMetrics
exporter._exportMetrics = (response) => {
  exporter.collect().then((collectionResult) => {
    const { resourceMetrics, errors } = collectionResult;
    for (const metric of resourceMetrics.scopeMetrics) {
      console.log(`scope: ${metric.scope.name}`);
      console.log(metric);
      for (const series of metric.metrics) {
        console.log(series.descriptor);
      }
    }
  }) 
  old(response);
}

provider.addMetricReader(exporter);
api.metrics.setGlobalMeterProvider(provider);

new HttpInstrumentation()

const http = require('http');
http.createServer((req, res) => {
    res.end('hello');
}).listen(1234);

and I see:

scope: @opentelemetry/instrumentation-http
{
  scope: {
    name: '@opentelemetry/instrumentation-http',
    version: '0.34.0',
    schemaUrl: undefined
  },
  metrics: [
    {
      descriptor: [Object],
      aggregationTemporality: 1,
      dataPointType: 0,
      dataPoints: [Array]
    },
    {
      descriptor: [Object],
      aggregationTemporality: 1,
      dataPointType: 0,
      dataPoints: []
    }
  ]
}
{
  name: 'http.server.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the inbound HTTP requests',
  unit: 'ms',
  valueType: 1
}
{
  name: 'http.client.duration',
  type: 'HISTOGRAM',
  description: 'measures the duration of the outbound HTTP requests',
  unit: 'ms',
  valueType: 1
}

@mothershipper
Copy link
Contributor Author

To be clear, the individual data points aren't duplicated, but in the case where the HTTP instrumentation produces 4 metrics under the scoped metrics, 2 of those metrics will never have any data points, because the instrument threw away the metrics and re-created them in _updateMetricInstruments

@mothershipper
Copy link
Contributor Author

Although "same metric names" can exist in the output when you request http://localhost:9464/metrics, they are not "identical" metric records since their attributes (or labels in prometheus) are not the same.

In the example below (working properly) each line has different labels, correct:

# HELP http_server_duration measures the duration of the inbound HTTP requests
# UNIT http_server_duration ms
# TYPE http_server_duration histogram
http_server_duration_count{http_scheme="http",http_method="GET",net_host_name="localhost",http_flavor="1.1",http_status_code="200",net_host_port="1234"} 2 1670974575998
http_server_duration_sum{http_scheme="http",http_method="GET",net_host_name="localhost",http_flavor="1.1",http_status_code="200",net_host_port="1234"} 9 1670974575998

In the case where we call registerInstruments, we get the # HELP, #UNIT and #TYPE lines twice, which is the duplicate I was referencing.

# HELP http_server_duration measures the duration of the inbound HTTP requests
# UNIT http_server_duration ms
# TYPE http_server_duration histogram
<!-- Prometheus reads this as an empty metric -->
# HELP http_server_duration measures the duration of the inbound HTTP requests
# UNIT http_server_duration ms
# TYPE http_server_duration histogram
http_server_duration_count{http_scheme="http",http_method="GET",net_host_name="localhost",http_flavor="1.1",http_status_code="200",net_host_port="1234"} 2 1670974575998
http_server_duration_sum{http_scheme="http",http_method="GET",net_host_name="localhost",http_flavor="1.1",http_status_code="200",net_host_port="1234"} 9 1670974575998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect signal:metrics Issues and PRs related to general metrics signal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants