Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collector fails to restart with persistent queue and health check enabled #32456

Closed
ElfoLiNk opened this issue Apr 16, 2024 · 6 comments
Closed
Assignees

Comments

@ElfoLiNk
Copy link

ElfoLiNk commented Apr 16, 2024

Component(s)

extension/storage/filestorage

extension/healthcheck

What happened?

Description

Pod goes in crashloopbackoff when restarted with persistent queue enabled

Steps to Reproduce

Configure a filestorage extension and use it in a persistent queue in export

Expected Result

Should start normally after a restart

Actual Result

Fails to start due to liveness probe failures

Collector version

0.97.0

Environment information

Environment

AKS
Helm Chart Opentelemetry Collector Statefulset

OpenTelemetry Collector configuration

extensions:
      file_storage/queue:
        compaction:
          directory: /tmp
          on_rebound: true
          on_start: true
        directory: /var/lib/storage/queue
        timeout: 1s
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
exporters:
      debug: {}
      otlp/elastic:
        endpoint: apm-server-apm-http:8200
        headers:
          Authorization: Bearer ${ELASTIC_APM_SECRET_TOKEN}
        sending_queue:
          queue_size: 50000
          storage: file_storage/queue
        tls:
          insecure_skip_verify: true

Log output

2024-04-16T17:23:06.087Z	info	service@v0.97.0/service.go:143	Starting otelcol-contrib...	{"Version": "0.97.0", "NumCPU": 8}
2024-04-16T17:23:06.087Z	info	extensions/extensions.go:34	Starting extensions...
2024-04-16T17:23:06.087Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "file_storage/queue"}
2024-04-16T17:23:06.087Z	info	extensions/extensions.go:52	Extension started.	{"kind": "extension", "name": "file_storage/queue"}
2024-04-16T17:23:06.087Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "health_check"}
2024-04-16T17:23:06.087Z	info	healthcheckextension@v0.97.0/healthcheckextension.go:35	Starting health_check extension	{"kind": "extension", "name": "health_check", "config": {"Endpoint":"192.169.1.83:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-04-16T17:23:06.088Z	info	extensions/extensions.go:52	Extension started.	{"kind": "extension", "name": "health_check"}
 Warning  Unhealthy  11m (x6 over 11m)      kubelet            Liveness probe failed: HTTP probe failed with statuscode: 503

Additional context

Debug logs:

2024-04-16T17:52:47.351Z	info	service@v0.97.0/service.go:143	Starting otelcol-contrib...	{"Version": "0.97.0", "NumCPU": 16}
2024-04-16T17:52:47.351Z	info	extensions/extensions.go:34	Starting extensions...
2024-04-16T17:52:47.351Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "file_storage/queue"}
2024-04-16T17:52:47.351Z	info	extensions/extensions.go:52	Extension started.	{"kind": "extension", "name": "file_storage/queue"}
2024-04-16T17:52:47.351Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "health_check"}
2024-04-16T17:52:47.351Z	info	healthcheckextension@v0.97.0/healthcheckextension.go:35	Starting health_check extension	{"kind": "extension", "name": "health_check", "config": {"Endpoint":"192.169.1.102:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-04-16T17:52:47.352Z	info	extensions/extensions.go:52	Extension started.	{"kind": "extension", "name": "health_check"}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] Channel created	{"grpc_log": true}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] original dial target is: "apm-server-apm-http:8200"	{"grpc_log": true}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] parsed dial target is: resolver.Target{URL:url.URL{Scheme:"apm-server-apm-http", Opaque:"8200", User:(*url.Userinfo)(nil), Host:"", Path:"", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}	{"grpc_log": true}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] fallback to scheme "passthrough"	{"grpc_log": true}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] parsed dial target is: passthrough:///apm-server-apm-http:8200	{"grpc_log": true}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] Channel authority set to "apm-server-apm-http:8200"	{"grpc_log": true}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "apm-server-apm-http:8200",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "apm-server-apm-http:8200",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)	{"grpc_log": true}
2024-04-16T17:52:47.352Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] Channel switches to new LB policy "pick_first"	{"grpc_log": true}
2024-04-16T17:52:47.353Z	info	zapgrpc/zapgrpc.go:176	[core] [pick-first-lb 0xc002fd6030] Received new config {
  "shuffleAddressList": false
}, resolver state {
  "Addresses": [
    {
      "Addr": "apm-server-apm-http:8200",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "apm-server-apm-http:8200",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
}	{"grpc_log": true}
2024-04-16T17:52:47.353Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2] Subchannel created	{"grpc_log": true}
2024-04-16T17:52:47.353Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] Channel Connectivity change to CONNECTING	{"grpc_log": true}
2024-04-16T17:52:47.353Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] Channel exiting idle mode	{"grpc_log": true}
2024-04-16T17:52:47.353Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING	{"grpc_log": true}
2024-04-16T17:52:47.353Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2] Subchannel picks a new address "apm-server-apm-http:8200" to connect	{"grpc_log": true}
2024-04-16T17:52:47.353Z	info	zapgrpc/zapgrpc.go:176	[core] [pick-first-lb 0xc002fd6030] Received SubConn state update: 0xc002fd61b0, {ConnectivityState:CONNECTING ConnectionError:<nil>}	{"grpc_log": true}
2024-04-16T17:52:47.373Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to READY	{"grpc_log": true}
2024-04-16T17:52:47.373Z	info	zapgrpc/zapgrpc.go:176	[core] [pick-first-lb 0xc002fd6030] Received SubConn state update: 0xc002fd61b0, {ConnectivityState:READY ConnectionError:<nil>}	{"grpc_log": true}
2024-04-16T17:52:47.373Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1] Channel Connectivity change to READY	{"grpc_log": true}
2024-04-16T17:52:47.786Z	debug	filestorage@v0.97.0/client.go:184	starting compaction	{"kind": "extension", "name": "file_storage/queue", "directory": "/var/lib/storage/queue/exporter_otlp_elastic_traces", "tempDirectory": "/tmp/tempdb3050547972"}
2024-04-16T17:52:47.786Z	debug	filestorage@v0.97.0/client.go:245	starting compaction loop	{"kind": "extension", "name": "file_storage/queue", "compaction_check_interval": 5}
2024-04-16T17:52:52.328Z	debug	memorylimiter/memorylimiter.go:200	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 71}
2024-04-16T17:52:52.787Z	debug	filestorage@v0.97.0/client.go:282	shouldCompact check	{"kind": "extension", "name": "file_storage/queue", "totalSizeBytes": 2044944384, "dataSizeBytes": 1953927168}
2024-04-16T17:52:57.327Z	debug	memorylimiter/memorylimiter.go:200	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 56}
2024-04-16T17:52:57.787Z	debug	filestorage@v0.97.0/client.go:282	shouldCompact check	{"kind": "extension", "name": "file_storage/queue", "totalSizeBytes": 2044944384, "dataSizeBytes": 1953927168}
2024-04-16T17:53:02.341Z	debug	memorylimiter/memorylimiter.go:200	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 47}
2024-04-16T17:53:02.787Z	debug	filestorage@v0.97.0/client.go:282	shouldCompact check	{"kind": "extension", "name": "file_storage/queue", "totalSizeBytes": 2044944384, "dataSizeBytes": 1953927168}
2024-04-16T17:53:07.328Z	debug	memorylimiter/memorylimiter.go:200	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 60}
2024-04-16T17:53:07.787Z	debug	filestorage@v0.97.0/client.go:282	shouldCompact check	{"kind": "extension", "name": "file_storage/queue", "totalSizeBytes": 2044944384, "dataSizeBytes": 1953927168}
2024-04-16T17:53:12.339Z	debug	memorylimiter/memorylimiter.go:200	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 45}
2024-04-16T17:53:12.787Z	debug	filestorage@v0.97.0/client.go:282	shouldCompact check	{"kind": "extension", "name": "file_storage/queue", "totalSizeBytes": 2044944384, "dataSizeBytes": 1953927168}
@ElfoLiNk ElfoLiNk added bug Something isn't working needs triage New item requiring triage labels Apr 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski djaglowski added the extension/healthcheck Health Check Extension label Apr 16, 2024
Copy link
Contributor

Pinging code owners for extension/healthcheck: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ElfoLiNk
Copy link
Author

I think the issue is related to compation on_start enabled, disabling it collector passes the health check. Would be good to have a startup probe for this maybe instead of heavily changing the liveness check. Opened open-telemetry/opentelemetry-helm-charts#1152

djaglowski added a commit that referenced this issue May 16, 2024
…n temporary files (#32863)

**Description:**
This PR includes a new flag **cleanup_on_start** for the compaction
section.
During compaction a copy of the database is created, when the process is
unexpectedly terminated that temporary file is not removed. That could
lead to disk exhaustion given the following scenario:
- Process is killed with a big database to be compacted
- Compaction is enabled on start
- Process will take longer to compact than the allotted time for the
collector to reply health checks (see: #32456)
- Process is killed while compacting
- Big temporary file left

This mitigates the potential risk of those temporary files left in a
short period of time, by this scenario or similar ones.

**Testing:**
Included corner case where two instances of the extensions are spawned
and one is compacting while the other would attempt to cleanup.

**Documentation:** 
Included description in the README of the new configuration flag

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 20, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants