[Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3166

ycombinator · 2023-08-01T19:15:06Z

What does this PR do?

This PR fixes a bug in the Upgrade Watcher's Crash Checker where it was consider the Agent process as healthy (not crashed) despite its PID remaining 0 every time the Crash Checker retrieved the PID from the service.

Why is it important?

To detect when Agent has crashed so the Upgrade Watcher can initiate a rollback.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~I have added an integration test or an E2E test~~

How to test this PR locally

Manually testing this PR is currently blocked on #3377.

Testing this PR locally is not trivial, but it's possible. It involves upgrading to an Agent build where the Agent binary deliberately crashes, and making sure that the Agent is then rolled back to the previous (pre-upgrade) version.

Install and run Elastic Agent >= 8.10.0. This is necessary because we need the pre-upgrade Agent to kick off the Upgrade Watcher using the post-upgrade Agent's binary, a change that was implemented in 8.10.0.
Checkout this PR and make changes to the elastic-agent run code path such that the Agent process will exit with an error. The easiest change would probably be to add a small sleep, say 5 seconds, followed by returning an error, right here:

elastic-agent/internal/pkg/agent/cmd/run.go

Lines 65 to 66 in 539a5f2

RunE: func(cmd *cobra.Command, _ []string) error {

// done very early so the encrypted store is never used
Also bump up the Agent version to 8.12.0 over here so the upgrade is possible:

elastic-agent/version/version.go

Line 7 in 243c76b

const defaultBeatVersion = "8.11.0"
Build the Agent package with the changes. Since the Agent version has been bumped up and the corresponding component binaries won't be available, make sure to set AGENT_DROP_PATH to nothing. Make sure NOT to use SNAPSHOT=true otherwise the snapshot artifact downloader will kick in during the upgrade process.
```
DEV=true AGENT_DROP_PATH= PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package
```

Upgrade the running Agent to the built Agent.

sudo elastic-agent upgrade 8.12.0 --source-uri file:///home/shaunak/development/github/elastic-agent/build/distributions/ --skip-verify

Ensure that the Agent is upgrading.
```
sudo elastic-agent status
```
While the upgrade is in progress, watch the Upgrade Watcher's log.
After a couple of minutes, check that the Agent was rolled back.
```
sudo elastic-agent version
```

Related issues

Closes Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

mergify · 2023-08-01T19:15:47Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2023-08-01T19:30:12Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-09-08T14:35:44.076+0000
Duration: 27 min 17 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6281
Skipped	55
Total	6336

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-08-01T20:01:57Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.78% (`81/82`)	👍
Files	66.212% (`194/293`)	👍
Classes	65.562% (`356/543`)	👍
Methods	52.701% (`1122/2129`)	👍 0.089
Lines	38.232% (`12759/33373`)	👍 0.059
Conditionals	100.0% (`0/0`)	💚

ycombinator · 2023-08-02T00:17:31Z

By adding a bunch of logging statements in the rollback code path, I've discovered what is potentially another bug or an area for improvement.

{"log.level":"error","@timestamp":"2023-08-02T00:05:47.525Z","log.origin":{"file.name":"cmd/watch.go","file.line":178},"message":"Agent crash detected: service remained crashed (PID = 0) within '10' seconds","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:05:47.525Z","log.origin":{"file.name":"cmd/watch.go","file.line":105},"message":"Error detected proceeding to rollback: service remained crashed (PID = 0) within '10' seconds","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-08-02T00:05:47.525Z","log.origin":{"file.name":"upgrade/step_relink.go","file.line":42},"message":"Changing symlink","symlink_path":"/opt/Elastic/Agent/elastic-agent","new_path":"/opt/Elastic/Agent/data/elastic-agent-1ed06f/elastic-agent","prev_path":"/opt/Elastic/Agent/elastic-agent.prev","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-08-02T00:05:47.527Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":131},"message":"Updating active commit","file.path":"/opt/Elastic/Agent/.elastic-agent.active.commit","hash":"1ed06f","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-08-02T00:05:47.528Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":47},"message":"Restarting the agent after rollback","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:05:47.528Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 5","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:05:57.528Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":132},"message":"In restartFn","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:05:57.529Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":143},"message":"Failed to trigger restart of running Agent daemon: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused\"","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:05:57.529Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 4","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:17.529Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":132},"message":"In restartFn","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:06:17.530Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":143},"message":"Failed to trigger restart of running Agent daemon: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused\"","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:17.530Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 3","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:57.530Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":132},"message":"In restartFn","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:06:57.531Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":143},"message":"Failed to trigger restart of running Agent daemon: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused\"","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:57.531Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 2","ecs.version":"1.6.0"}

It looks like the Upgrade Watcher does its job correctly in deciding to rollback to the old Agent. At that point it correctly switches the symlink and updates the active commit in the agent commit file. Then it tries to restart the Agent by trying to connect to the Agent GRPC server and issue a Restart request. This fails because there is no Agent running to receive this request.

What's worse is that because the restart process fails, the Upgrade Watcher never gets to the step of cleaning up the new (post-upgrade) Agent's files and, more importantly, the Upgrade Marker file. Eventually, the service starts up the old (pre-upgrade) Agent, which sees that there's an Upgrade Marker file present, and starts the Upgrade Watcher. For reasons I haven't figured out yet, this seems to be the new agent's Upgrade Watcher. This Upgrade Watcher process eventually (in about 10 minutes) succeeds as the old Agent is healthy. Upon success, this Upgrade Watcher then proceeds to clean up files for any Agent installations other than it's own... which means it ends up cleaning up the old — and currently running — Agent's files!

ycombinator · 2023-08-02T07:11:16Z

It looks like the Upgrade Watcher does its job correctly in deciding to rollback to the old Agent. At that point it correctly switches the symlink and updates the active commit in the agent commit file. Then it tries to restart the Agent by trying to connect to the Agent GRPC server and issue a Restart request. This fails because there is no Agent running to receive this request.

What's worse is that because the restart process fails, the Upgrade Watcher never gets to the step of cleaning up the new (post-upgrade) Agent's files and, more importantly, the Upgrade Marker file. Eventually, the service starts up the old (pre-upgrade) Agent, which sees that there's an Upgrade Marker file present, and starts the Upgrade Watcher. For reasons I haven't figured out yet, this seems to be the new agent's Upgrade Watcher. This Upgrade Watcher process eventually (in about 10 minutes) succeeds as the old Agent is healthy. Upon success, this Upgrade Watcher then proceeds to clean up files for any Agent installations other than it's own... which means it ends up cleaning up the old — and currently running — Agent's files!

After thinking about this a bit, my instinct says that we should change the order of operations for the rollback process.

Currently, the order of operations is:

Switch symlink
Update active commit in the agent commit file
Restart Agent (the one we rolled back to)
Cleanup Agent (the one we rolled back from) files + upgrade marker file

I think we should change the order of operations to:

Switch symlink
Update active commit in the agent commit file
Cleanup Agent (the one we rolled back from) files + upgrade marker file
Restart Agent (the one we rolled back to)

This way, if the final restart step fails (as is happening while testing this PR; see previous comment for details), at least all the installed files are in a consistent state. Additionally, the service manager may yet restart the correct Agent process, effectively completing the final restart step.

@cmacknz @michalpristas WDYT?

ycombinator · 2023-08-02T15:46:54Z

Discussed the proposed solution in #3166 (comment) in today's team meeting. Unfortunately the solution is not as straightforward as proposed, mainly due to Windows systems not allowing cleanup of a running process's files (which could happen if the new Agent was running but unhealthy).

One idea that was discussed was to move the cleanup of no-longer-needed Agent files to the Agent itself. But this has some edge cases to consider, e.g. when we need the new Agent to keep the older Agent's files around for a potential rollback. Some potential solutions to such cases were also discussed, e.g. the Upgrade Watcher storing enough information about the desired outcome in the upgrade marker file and not deleting it, but this could potentially be problematic for already-released versions of Agent which may not be expecting the upgrade marker file to be present under the happy path.

After much discussion, the thinking is that we need to first get a handle on the various interactions between the Upgrade Watcher, the upgrade marker file, and the Agent first so we can understand the various failure scenarios as well as the impact of any potential changes to the rollback process.

mergify · 2023-08-02T17:45:41Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b crash-checker-pid-zero upstream/crash-checker-pid-zero
git merge upstream/main
git push upstream crash-checker-pid-zero

ycombinator · 2023-08-03T23:17:56Z

I just rebased this PR on main so it now includes the InstallChecker enhancement. Testing again, I'm seeing a new/different potential bug.

To recap, this PR deliberately (just for testing) produces an Agent binary that exits with an error after sleeping for 11 seconds. The test is to start with a good build of Agent, say from main, and then try to upgrade to the Agent built from this PR. I just ran this test and here's what I'm seeing in the Upgrade Watcher log:

$ cat /opt/Elastic/Agent/data/elastic-agent-3b6e07/logs/elastic-agent-watcher-20230803.ndjson
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.005Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":62},"message":"running checks using 'DBus' controller","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/error_checker.go","file.line":52},"message":"Error checker started","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":71},"message":"Crash checker started","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":73},"message":"watcher having PID: 1785608","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/install_checker.go","file.line":43},"message":"Install checker started","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.009Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":87},"message":"retrieved service PID [1700826]","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.011Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":128},"message":"service PID changed 1 times within 6 evaluations","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.011Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":73},"message":"watcher having PID: 1785608","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.019Z","log.origin":{"file.name":"upgrade/install_checker.go","file.line":54},"message":"retrieve service status: installed","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":87},"message":"retrieved service PID [0]","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":110},"message":"most recent 2 service PIDs within 6 evaulations: [0 1700826]","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":128},"message":"service PID changed 2 times within 6 evaluations","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":73},"message":"watcher having PID: 1785608","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":195},"message":"Agent uninstall detected","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":107},"message":"Exiting early due to: %vElastic Agent was uninstalled: service is not installed","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":51},"message":"Watch command failed","error":{"message":"Elastic Agent was uninstalled: service is not installed"},"ecs.version":"1.6.0"}

But while the upgrade was running, and even after the watcher exited, I can see that the Agent service is installed. I'm on Ubuntu so the service manager is systemd.

$ systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2023-08-03 23:13:19 UTC; 1min 58s ago
    Process: 1826225 ExecStart=/usr/bin/elastic-agent (code=exited, status=1/FAILURE)
   Main PID: 1826225 (code=exited, status=1/FAILURE)
        CPU: 38ms

Aug 03 23:13:19 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Aug 03 23:13:19 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.

@blakerouse Given that the service is still installed, shouldn't the InstallChecker keep succeeding? The Upgrade Watcher logs seem to indicate that the InstallChecker fails because the service isn't installed.

{"log.level":"warn","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":195},"message":"Agent uninstall detected","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":107},"message":"Exiting early due to: %vElastic Agent was uninstalled: service is not installed","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":51},"message":"Watch command failed","error":{"message":"Elastic Agent was uninstalled: service is not installed"},"ecs.version":"1.6.0"}

To be clear, the service is installed, as seen from the systemctl status elastic-agent.service output above, but the Agent process that the service controls keeps crashing (deliberately, as part of testing this PR).

If you agree this is a bug, I can create an issue for it with steps to reproduce.

blakerouse · 2023-08-03T23:40:14Z

@ycombinator It definitely should not be saying it's uninstalled. Either I did something wrong or the service module we use which is an external dependencies is doing something wrong.

Might be better off to revert my install checker PR. Don't want that be killing the watcher all the time, that would be bad.

ycombinator · 2023-08-03T23:47:55Z

Thanks @blakerouse for the quick check.

This PR here has a lot of changes in it. Let me try to come up with a minimal set of steps to reproduce the issue. I'll file an issue with the steps to reproduce and then we can decide if it makes sense to resolve that by reverting the install checker PR or fixing forward. Either way, it'll be good to have minimal repro steps documented in case we decide to bring the install checker back in the future with tweaks.

ycombinator · 2023-08-04T00:28:20Z

This PR here has a lot of changes in it. Let me try to come up with a minimal set of steps to reproduce the issue. I'll file an issue with the steps to reproduce and then we can decide if it makes sense to resolve that by reverting the install checker PR or fixing forward. Either way, it'll be good to have minimal repro steps documented in case we decide to bring the install checker back in the future with tweaks.

#3188

ycombinator · 2023-08-04T00:45:09Z

After much discussion, the thinking is that we need to first get a handle on the various interactions between the Upgrade Watcher, the upgrade marker file, and the Agent first so we can understand the various failure scenarios as well as the impact of any potential changes to the rollback process.

PR to document the details of these interactions: #3189

mergify · 2023-08-09T15:34:12Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b crash-checker-pid-zero upstream/crash-checker-pid-zero
git merge upstream/main
git push upstream crash-checker-pid-zero

ycombinator · 2023-08-21T18:21:57Z

Waiting on #3268 to fix the bug explained in #3166 (comment).

mergify · 2023-08-28T17:04:20Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b crash-checker-pid-zero upstream/crash-checker-pid-zero
git merge upstream/main
git push upstream crash-checker-pid-zero

ycombinator · 2023-09-07T17:02:25Z

internal/pkg/agent/application/upgrade/crash_checker.go

@@ -29,7 +29,7 @@ type serviceHandler interface {
 // CrashChecker checks agent for crash pattern in Elastic Agent lifecycle.
 type CrashChecker struct {
 	notifyChan    chan error
-	q             *disctintQueue
+	q             *distinctQueue


Just fixing a typo.

elastic-sonarqube · 2023-09-08T14:47:23Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

90.0% Coverage
0.0% Duplication

… its PID remains 0 (#3166) * Refactoring: extract helper method * Add check for PID remaining 0 * Update + add tests * Fix typo * Add CHANGELOG fragment * Better error messages * Bump up Agent version + cause error on start * Better logging for debugging * More logging for debugging * Trying secondary restart via service manager * Add FIXME comments for testing-only changes * Fix compile errors * Update testing version * Implement restart for upstart service manager * Include service provider name in error * Implement restart for sysv and darwin * Implement Restart for Windows * Remove all Restart() implementations * Removing extraneous logging statements * Undo vestigial changes * Rename all canc -> cancel * Use assert instead of require * Remove testing changes * Use assert instead of require (cherry picked from commit 2ce32f8)

… its PID remains 0 (#3166) (#3386) * Refactoring: extract helper method * Add check for PID remaining 0 * Update + add tests * Fix typo * Add CHANGELOG fragment * Better error messages * Bump up Agent version + cause error on start * Better logging for debugging * More logging for debugging * Trying secondary restart via service manager * Add FIXME comments for testing-only changes * Fix compile errors * Update testing version * Implement restart for upstart service manager * Include service provider name in error * Implement restart for sysv and darwin * Implement Restart for Windows * Remove all Restart() implementations * Removing extraneous logging statements * Undo vestigial changes * Rename all canc -> cancel * Use assert instead of require * Remove testing changes * Use assert instead of require (cherry picked from commit 2ce32f8) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Aug 1, 2023

mergify bot assigned ycombinator Aug 1, 2023

mergify bot added the backport-skip label Aug 1, 2023

ycombinator added backport-v8.9.0 Automated backport with mergify and removed backport-skip labels Aug 1, 2023

ycombinator force-pushed the crash-checker-pid-zero branch from ebadd57 to 21e4e2a Compare August 3, 2023 23:10

ycombinator mentioned this pull request Aug 10, 2023

systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start #3123

Closed

ycombinator mentioned this pull request Aug 18, 2023

[Upgrade Watcher] Try restarting Agent in multiple ways during rollback #3268

Merged

7 tasks

ycombinator force-pushed the crash-checker-pid-zero branch from 21e4e2a to b1dedf1 Compare August 21, 2023 18:22

ycombinator mentioned this pull request Aug 15, 2023

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Closed

15 tasks

ycombinator force-pushed the crash-checker-pid-zero branch from b1dedf1 to 9cd0ca9 Compare August 28, 2023 13:44

ycombinator force-pushed the crash-checker-pid-zero branch 2 times, most recently from 62359a7 to d523003 Compare September 7, 2023 16:06

ycombinator commented Sep 7, 2023

View reviewed changes

ycombinator added 19 commits September 8, 2023 07:35

Better error messages

8607ca5

Bump up Agent version + cause error on start

e48e4aa

Better logging for debugging

474fb2c

More logging for debugging

dc779f5

Trying secondary restart via service manager

55cb464

Add FIXME comments for testing-only changes

ba6cf21

Fix compile errors

5dc167c

Update testing version

0bec1dd

Implement restart for upstart service manager

f3d0e98

Include service provider name in error

cdd2b7a

Implement restart for sysv and darwin

ed0fd09

Implement Restart for Windows

3361360

Remove all Restart() implementations

6a6e56e

Removing extraneous logging statements

fb14eb1

Undo vestigial changes

4eaa7f1

Rename all canc -> cancel

11989d9

Use assert instead of require

9dba6e6

Remove testing changes

7a804bb

Use assert instead of require

37374d7

ycombinator force-pushed the crash-checker-pid-zero branch from b6bc82a to 37374d7 Compare September 8, 2023 14:35

ycombinator enabled auto-merge (squash) September 8, 2023 14:40

AndersonQ approved these changes Sep 8, 2023

View reviewed changes

ycombinator merged commit 2ce32f8 into elastic:main Sep 8, 2023

mergify bot mentioned this pull request Sep 8, 2023

[8.10](backport #3166) [Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3386

Merged

ycombinator deleted the crash-checker-pid-zero branch September 8, 2023 16:00

	RunE: func(cmd *cobra.Command, _ []string) error {
	// done very early so the encrypted store is never used

[Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3166

[Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3166

Uh oh!

Conversation

ycombinator commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Uh oh!

mergify bot commented Aug 1, 2023

Uh oh!

elasticmachine commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Uh oh!

elasticmachine commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🌐 Coverage report

Uh oh!

ycombinator commented Aug 2, 2023

Uh oh!

ycombinator commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ycombinator commented Aug 2, 2023

Uh oh!

mergify bot commented Aug 2, 2023

Uh oh!

ycombinator commented Aug 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blakerouse commented Aug 3, 2023

Uh oh!

ycombinator commented Aug 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ycombinator commented Aug 4, 2023

Uh oh!

ycombinator commented Aug 4, 2023

Uh oh!

mergify bot commented Aug 9, 2023

Uh oh!

ycombinator commented Aug 21, 2023

Uh oh!

mergify bot commented Aug 28, 2023

Uh oh!

ycombinator Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

elastic-sonarqube bot commented Sep 8, 2023

Uh oh!

Uh oh!

ycombinator commented Aug 1, 2023 •

edited

Loading

elasticmachine commented Aug 1, 2023 •

edited

Loading

elasticmachine commented Aug 1, 2023 •

edited

Loading

ycombinator commented Aug 2, 2023 •

edited

Loading

ycombinator commented Aug 3, 2023 •

edited

Loading

ycombinator commented Aug 3, 2023 •

edited

Loading