Skip to content

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd  #3124

Closed

Description

This bug was noticed when writing an integration test where the upgraded Agent failed to start up (PR).

For confirmed bugs, please report:

  • Version: main / 8.10.0-SNAPSHOT
  • Operating System: Linux
  • Steps to Reproduce:
  1. To reproduce this bug, we need to upgrade to an Agent whose binary crashes upon start. The integration test in this PR builds such a failing, fake Agent binary, and packages it up.

  2. Try to upgrade to the Agent on Linux using elastic-agent upgrade <version> --source-uri file:///path/to/fake/failing/agent/package.tgz.

  3. While the upgrade is in progress, monitor the status of the Elastic Agent service:

    $ watch -n1 systemctl status elastic-agent.service
    
    Every 1.0s: systemctl status elastic-agent.service                                                                  tdwijr: Fri Jul 14 18:51:38 2023
    
    ● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
         Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
         Active: deactivating (stop-sigterm) (Result: exit-code) since Fri 2023-07-14 18:50:50 UTC; 48s ago
        Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101)
       Main PID: 97938 (code=exited, status=101)
          Tasks: 6 (limit: 4637)
         Memory: 105.8M
            CPU: 3.386s
         CGroup: /system.slice/elastic-agent.service
                 └─98040 /opt/Elastic/Agent/data/elastic-agent-903287/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent
    
  4. Note the Main PID in the systemctl status elastic-agent.service command's output. In the above example, it is 97938.

  5. Now look at the Crash Checker's logs in the Upgrade Watcher's log and note the PID mentioned in there:

    $ grep -i 'crash.*service pid' /opt/Elastic/Agent/data/elastic-agent-903287/logs/elastic-agent-watcher-20230714-1.ndjson
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:00.213Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:10.214Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:20.215Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:30.216Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:40.217Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:50.218Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:52:00.219Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:52:10.220Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:52:20.222Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    

    The PID in the Crash Checker logs is 0, which is not the PID of the Agent process (which should be the value of Main PID in the systemctl status elastic-agent.service command's output). The PID detected by the Crash Checker needs to be the Agent's PID since the Crash Checker detects that the Agent process has crashed based on how many times its PID changes within a certain time interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Team:Elastic-AgentLabel for the Agent teambugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions