Description
This bug was noticed when writing an integration test where the upgraded Agent failed to start up (PR).
For confirmed bugs, please report:
- Version:
main
/8.10.0-SNAPSHOT
- Operating System: Linux
- Steps to Reproduce:
-
To reproduce this bug, we need to upgrade to an Agent whose binary crashes upon start. The integration test in this PR builds such a failing, fake Agent binary, and packages it up.
-
Try to upgrade to the Agent on Linux using
elastic-agent upgrade <version> --source-uri file:///path/to/fake/failing/agent/package.tgz
. -
While the upgrade is in progress, monitor the status of the Elastic Agent service:
$ watch -n1 systemctl status elastic-agent.service
Every 1.0s: systemctl status elastic-agent.service tdwijr: Fri Jul 14 18:51:38 2023 ● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system. Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled) Active: deactivating (stop-sigterm) (Result: exit-code) since Fri 2023-07-14 18:50:50 UTC; 48s ago Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101) Main PID: 97938 (code=exited, status=101) Tasks: 6 (limit: 4637) Memory: 105.8M CPU: 3.386s CGroup: /system.slice/elastic-agent.service └─98040 /opt/Elastic/Agent/data/elastic-agent-903287/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent
-
Note the
Main PID
in thesystemctl status elastic-agent.service
command's output. In the above example, it is97938
. -
Now look at the Crash Checker's logs in the Upgrade Watcher's log and note the PID mentioned in there:
$ grep -i 'crash.*service pid' /opt/Elastic/Agent/data/elastic-agent-903287/logs/elastic-agent-watcher-20230714-1.ndjson {"log.level":"debug","@timestamp":"2023-07-14T18:51:00.213Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:51:10.214Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:51:20.215Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:51:30.216Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:51:40.217Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:51:50.218Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:52:00.219Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:52:10.220Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"} {"log.level":"debug","@timestamp":"2023-07-14T18:52:20.222Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
The PID in the Crash Checker logs is
0
, which is not the PID of the Agent process (which should be the value ofMain PID
in thesystemctl status elastic-agent.service
command's output). The PID detected by the Crash Checker needs to be the Agent's PID since the Crash Checker detects that the Agent process has crashed based on how many times its PID changes within a certain time interval.