Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus receiver miss some metrics #34727

Open
peachisai opened this issue Aug 19, 2024 · 14 comments
Open

Prometheus receiver miss some metrics #34727

peachisai opened this issue Aug 19, 2024 · 14 comments
Assignees
Labels
bug Something isn't working receiver/prometheus Prometheus receiver Stale waiting for author

Comments

@peachisai
Copy link

Component(s)

cmd/otelcontribcol

What happened?

Description

When I use prometheus receiver to grab metrics, I found otel miss someone, but it could grab other
mertics which have the similar structure.

Steps to Reproduce

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "nacos-monitoring"
          scrape_interval: 30s
          metrics_path: "/nacos/actuator/prometheus"
          static_configs:
            - targets: ['127.0.0.1:8848']
          relabel_configs:
            - source_labels: [ ]
              target_label: cluster
              replacement: nacos-cluster
            - source_labels: [ __address__ ]
              regex: (.+)
              target_label: node
              replacement: $$1

processors:
  batch:

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers:
        - prometheus
      processors:
        - batch
      exporters:
        - debug

Expected Result

orginal data

nacos_monitor{module="naming",name="serviceCount",} 0.0
nacos_monitor{module="naming",name="ipCount",} 0.0

Actual Result

Only get ipCount

NumberDataPoints #7
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(ipCount)
     -> node: Str(43.139.166.178:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-19 02:48:13.416 +0000 UTC
Value: 0.000000

Collector version

v0.107.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@peachisai peachisai added bug Something isn't working needs triage New item requiring triage labels Aug 19, 2024
@crobert-1 crobert-1 added the receiver/prometheus Prometheus receiver label Aug 19, 2024
Copy link
Contributor

Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

Do you see anything in the logs?

Can you enable debug logging, and let us know if there are any scrape failures, etc?

can you share the full scrape response for that metric?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

@peachisai
Copy link
Author

peachisai commented Aug 20, 2024

Do you see anything in the logs?

Can you enable debug logging, and let us know if there are any scrape failures, etc?

can you share the full scrape response for that metric?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

Hi, Thank you for the reply.
I use

exporters:
  debug:
    verbosity: detailed

These are some parts of my log.
I didn't find some errors or failures, and I can't found the missed target names.

StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> action: Str(end of minor GC)
     -> cause: Str(Allocation Failure)
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #37
Descriptor:
     -> Name: executor_pool_max_threads
     -> Description: The maximum allowed number of threads in the pool
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(applicationTaskExecutor)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 2147483647.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(taskScheduler)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 2147483647.000000
Metric #38
Descriptor:
     -> Name: nacos_naming_subscriber
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v2)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #39
Descriptor:
     -> Name: jvm_classes_loaded_classes
     -> Description: The number of classes that are currently loaded in the Java virtual machine
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 14983.000000
Metric #40
Descriptor:
     -> Name: tomcat_sessions_created_sessions_total
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #41
Descriptor:
     -> Name: tomcat_sessions_alive_max_seconds
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #42
Descriptor:
     -> Name: nacos_naming_publisher
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v2)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #43
Descriptor:
     -> Name: jvm_gc_memory_allocated_bytes_total
     -> Description: Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 31471073024.000000
Metric #44
Descriptor:
     -> Name: executor_completed_tasks_total
     -> Description: The approximate total number of tasks that have completed execution
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(applicationTaskExecutor)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(taskScheduler)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 181528.000000
Metric #45
Descriptor:
     -> Name: nacos_timer_seconds
     -> Description:
     -> Unit:
     -> DataType: Summary
SummaryDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(writeConfigRpcRt)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Count: 2
Sum: 0.114000
Metric #46
Descriptor:
     -> Name: jdbc_connections_min
     -> Description: Minimum number of idle connections in the pool.
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(dataSource)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: -1.000000
Metric #47
Descriptor:
     -> Name: http_server_requests_seconds_max
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/v2/core/cluster/node/list)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/actuator/prometheus)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.003789
NumberDataPoints #2
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/v1/console/namespaces)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SERVER_ERROR)
     -> status: Str(501)
     -> uri: Str(root)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #48
Descriptor:
     -> Name: jdbc_connections_max
     -> Description: Maximum number of active connections that can be allocated at the same time.
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(dataSource)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: -1.000000
Metric #49
Descriptor:
     -> Name: executor_queued_tasks
     -> Description: The approximate number of tasks that are queued for execution
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

@dashpole dashpole removed the needs triage New item requiring triage label Aug 21, 2024
@dashpole dashpole self-assigned this Aug 21, 2024
@peachisai
Copy link
Author

@dashpole Hi, I found this issue was assigned. If any detail should I provide, please ping me.

@dashpole
Copy link
Contributor

Were you able to check this?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

@peachisai
Copy link
Author

Were you able to check this?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

Hi, I did not find some error. did you mean config the receivers to get the scrape log? sorry I don't know how to do it, could you give me some advice?
This is my receiver config

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "nacos-monitoring"
          scrape_interval: 30s
          metrics_path: "/nacos/actuator/prometheus"
          static_configs:
            - targets: ['127.0.0.1:8848']
          relabel_configs:
            - source_labels: [ ]
              target_label: cluster
              replacement: nacos-cluster
            - source_labels: [ __address__ ]
              regex: (.+)
              target_label: node
              replacement: $$1

@dashpole
Copy link
Contributor

You should get additional metrics with names "up", and "scrape_series_added", and a few other scrape_.* metrics. The scrape.* metrics let you know if any metrics were dropped or rejected by Prometheus

@peachisai
Copy link
Author

failing

Hi,I filter the metrics up and scrape_*, still found nothing

Descriptor:
     -> Name: up
     -> Description: The scraping was successful
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Metric #4
Descriptor:
     -> Name: scrape_series_added
     -> Description: The approximate number of new series in this scrape
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Descriptor:
     -> Name: scrape_samples_post_metric_relabeling
     -> Description: The number of samples remaining after metric relabeling was applied
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Metric #1
Descriptor:
     -> Name: scrape_duration_seconds
     -> Description: Duration of the scrape
     -> Unit: s
     -> DataType: Gauge
NumberDataPoints #0

@dashpole
Copy link
Contributor

Right, you will need to look at the values of those metrics to see if any are being dropped, or if the target is down. Otherwise, if you can provide the full output of the prometheus endpoint (e.g. using curl), we can try to reproduce.

@peachisai
Copy link
Author

peachisai commented Aug 27, 2024

Right, you will need to look at the values of those metrics to see if any are being dropped, or if the target is down. Otherwise, if you can provide the full output of the prometheus endpoint (e.g. using curl), we can try to reproduce.

I browsed the log detailly but still found nothing contains error or drop. May I send you an email with my remote peer
endpoint ?

@dashpole
Copy link
Contributor

I browsed the log detailly but still anything contains error or drop. May I send you an email with my remote peer
endpoint ?

No, sorry. Please don't email me links. I also don't actually need your logs--I need the metrics scrape response.

@peachisai
Copy link
Author

I browsed the log detailly but still anything contains error or drop. May I send you an email with my remote peer
endpoint ?

No, sorry. Please don't email me links. I also don't actually need your logs--I need the metrics scrape response.

Hi, I found nothing drop or error in the metrics scrape response. But it overlooked some certain segments

nacos_monitor{module="naming",name="mysqlHealthCheck",} 0.0
nacos_monitor{module="naming",name="emptyPush",} 0.0
nacos_monitor{module="config",name="configCount",} 2.0
nacos_monitor_count{module="core",name="raft_read_from_leader",} 0.0
nacos_monitor_sum{module="core",name="raft_read_from_leader",} 0.0
nacos_monitor{module="naming",name="tcpHealthCheck",} 0.0
nacos_monitor{module="naming",name="serviceChangedEventQueueSize",} 0.0
nacos_monitor{module="core",name="longConnection",} 0.0
nacos_monitor{module="naming",name="totalPush",} 0.0
nacos_monitor{module="naming",name="serviceSubscribedEventQueueSize",} 0.0
nacos_monitor{module="naming",name="serviceCount",} 0.0
nacos_monitor{module="naming",name="httpHealthCheck",} 0.0
nacos_monitor{module="naming",name="maxPushCost",} -1.0
nacos_monitor{module="config",name="longPolling",} 0.0
nacos_monitor{module="naming",name="failedPush",} 0.0
nacos_monitor{module="naming",name="leaderStatus",} 0.0
nacos_monitor{module="config",name="publish",} 0.0
nacos_monitor{module="config",name="dumpTask",} 0.0
nacos_monitor_count{module="core",name="raft_read_index_failed",} 0.0
nacos_monitor_sum{module="core",name="raft_read_index_failed",} 0.0
nacos_monitor{module="config",name="notifyTask",} 0.0
nacos_monitor{module="config",name="fuzzySearch",} 0.0
nacos_monitor{module="naming",name="avgPushCost",} -1.0
nacos_monitor{module="config",name="getConfig",} 0.0
nacos_monitor{module="naming",name="totalPushCountForAvg",} 0.0
nacos_monitor{module="naming",name="subscriberCount",} 0.0
nacos_monitor{module="naming",name="ipCount",} 0.0
nacos_monitor{module="config",name="notifyClientTask",} 0.0
nacos_monitor{module="naming",name="totalPushCostForAvg",} 0.0
nacos_monitor{module="naming",name="pushPendingTaskCount",} 0.0
# HELP nacos_monitor_max  

Above nacos_monitor_sum{module="core",name="raft_read_index_failed",} 0.0 cannot be scraped
The rest metrics below it can be scraped

There are the scrape log

Descriptor:
     -> Name: disk_total_bytes
     -> Description: Total space for path
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> path: Str(D:\ideaprojects\github\nacos\.)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 296022437888.000000
Metric #69
Descriptor:
     -> Name: nacos_monitor
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(core)
     -> name: Str(raft_read_index_failed)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(notifyTask)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(fuzzySearch)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(avgPushCost)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: -1.000000
NumberDataPoints #4
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(getConfig)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #5
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(totalPushCountForAvg)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #6
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(subscriberCount)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #7
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(ipCount)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000

@peachisai
Copy link
Author

I will have a try to debug the code

Copy link
Contributor

github-actions bot commented Nov 7, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/prometheus Prometheus receiver Stale waiting for author
Projects
None yet
Development

No branches or pull requests

3 participants