-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3166
Conversation
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
NOTE: |
🌐 Coverage report
|
By adding a bunch of logging statements in the rollback code path, I've discovered what is potentially another bug or an area for improvement.
It looks like the Upgrade Watcher does its job correctly in deciding to rollback to the old Agent. At that point it correctly switches the symlink and updates the active commit in the agent commit file. Then it tries to restart the Agent by trying to connect to the Agent GRPC server and issue a What's worse is that because the restart process fails, the Upgrade Watcher never gets to the step of cleaning up the new (post-upgrade) Agent's files and, more importantly, the Upgrade Marker file. Eventually, the service starts up the old (pre-upgrade) Agent, which sees that there's an Upgrade Marker file present, and starts the Upgrade Watcher. For reasons I haven't figured out yet, this seems to be the new agent's Upgrade Watcher. This Upgrade Watcher process eventually (in about 10 minutes) succeeds as the old Agent is healthy. Upon success, this Upgrade Watcher then proceeds to clean up files for any Agent installations other than it's own... which means it ends up cleaning up the old — and currently running — Agent's files! |
After thinking about this a bit, my instinct says that we should change the order of operations for the rollback process. Currently, the order of operations is:
I think we should change the order of operations to:
This way, if the final restart step fails (as is happening while testing this PR; see previous comment for details), at least all the installed files are in a consistent state. Additionally, the service manager may yet restart the correct Agent process, effectively completing the final restart step. @cmacknz @michalpristas WDYT? |
Discussed the proposed solution in #3166 (comment) in today's team meeting. Unfortunately the solution is not as straightforward as proposed, mainly due to Windows systems not allowing cleanup of a running process's files (which could happen if the new Agent was running but unhealthy). One idea that was discussed was to move the cleanup of no-longer-needed Agent files to the Agent itself. But this has some edge cases to consider, e.g. when we need the new Agent to keep the older Agent's files around for a potential rollback. Some potential solutions to such cases were also discussed, e.g. the Upgrade Watcher storing enough information about the desired outcome in the upgrade marker file and not deleting it, but this could potentially be problematic for already-released versions of Agent which may not be expecting the upgrade marker file to be present under the happy path. After much discussion, the thinking is that we need to first get a handle on the various interactions between the Upgrade Watcher, the upgrade marker file, and the Agent first so we can understand the various failure scenarios as well as the impact of any potential changes to the rollback process. |
This pull request is now in conflicts. Could you fix it? 🙏
|
ebadd57
to
21e4e2a
Compare
I just rebased this PR on To recap, this PR deliberately (just for testing) produces an Agent binary that exits with an error after sleeping for 11 seconds. The test is to start with a good build of Agent, say from
But while the upgrade was running, and even after the watcher exited, I can see that the Agent service is installed. I'm on Ubuntu so the service manager is
@blakerouse Given that the service is still installed, shouldn't the
To be clear, the service is installed, as seen from the If you agree this is a bug, I can create an issue for it with steps to reproduce. |
@ycombinator It definitely should not be saying it's uninstalled. Either I did something wrong or the service module we use which is an external dependencies is doing something wrong. Might be better off to revert my install checker PR. Don't want that be killing the watcher all the time, that would be bad. |
Thanks @blakerouse for the quick check. This PR here has a lot of changes in it. Let me try to come up with a minimal set of steps to reproduce the issue. I'll file an issue with the steps to reproduce and then we can decide if it makes sense to resolve that by reverting the install checker PR or fixing forward. Either way, it'll be good to have minimal repro steps documented in case we decide to bring the install checker back in the future with tweaks. |
|
PR to document the details of these interactions: #3189 |
This pull request is now in conflicts. Could you fix it? 🙏
|
Waiting on #3268 to fix the bug explained in #3166 (comment). |
21e4e2a
to
b1dedf1
Compare
b1dedf1
to
9cd0ca9
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
62359a7
to
d523003
Compare
@@ -29,7 +29,7 @@ type serviceHandler interface { | |||
// CrashChecker checks agent for crash pattern in Elastic Agent lifecycle. | |||
type CrashChecker struct { | |||
notifyChan chan error | |||
q *disctintQueue | |||
q *distinctQueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixing a typo.
b6bc82a
to
37374d7
Compare
SonarQube Quality Gate |
… its PID remains 0 (#3166) * Refactoring: extract helper method * Add check for PID remaining 0 * Update + add tests * Fix typo * Add CHANGELOG fragment * Better error messages * Bump up Agent version + cause error on start * Better logging for debugging * More logging for debugging * Trying secondary restart via service manager * Add FIXME comments for testing-only changes * Fix compile errors * Update testing version * Implement restart for upstart service manager * Include service provider name in error * Implement restart for sysv and darwin * Implement Restart for Windows * Remove all Restart() implementations * Removing extraneous logging statements * Undo vestigial changes * Rename all canc -> cancel * Use assert instead of require * Remove testing changes * Use assert instead of require (cherry picked from commit 2ce32f8)
… its PID remains 0 (#3166) * Refactoring: extract helper method * Add check for PID remaining 0 * Update + add tests * Fix typo * Add CHANGELOG fragment * Better error messages * Bump up Agent version + cause error on start * Better logging for debugging * More logging for debugging * Trying secondary restart via service manager * Add FIXME comments for testing-only changes * Fix compile errors * Update testing version * Implement restart for upstart service manager * Include service provider name in error * Implement restart for sysv and darwin * Implement Restart for Windows * Remove all Restart() implementations * Removing extraneous logging statements * Undo vestigial changes * Rename all canc -> cancel * Use assert instead of require * Remove testing changes * Use assert instead of require (cherry picked from commit 2ce32f8)
… its PID remains 0 (#3166) * Refactoring: extract helper method * Add check for PID remaining 0 * Update + add tests * Fix typo * Add CHANGELOG fragment * Better error messages * Bump up Agent version + cause error on start * Better logging for debugging * More logging for debugging * Trying secondary restart via service manager * Add FIXME comments for testing-only changes * Fix compile errors * Update testing version * Implement restart for upstart service manager * Include service provider name in error * Implement restart for sysv and darwin * Implement Restart for Windows * Remove all Restart() implementations * Removing extraneous logging statements * Undo vestigial changes * Rename all canc -> cancel * Use assert instead of require * Remove testing changes * Use assert instead of require (cherry picked from commit 2ce32f8)
… its PID remains 0 (#3166) (#3386) * Refactoring: extract helper method * Add check for PID remaining 0 * Update + add tests * Fix typo * Add CHANGELOG fragment * Better error messages * Bump up Agent version + cause error on start * Better logging for debugging * More logging for debugging * Trying secondary restart via service manager * Add FIXME comments for testing-only changes * Fix compile errors * Update testing version * Implement restart for upstart service manager * Include service provider name in error * Implement restart for sysv and darwin * Implement Restart for Windows * Remove all Restart() implementations * Removing extraneous logging statements * Undo vestigial changes * Rename all canc -> cancel * Use assert instead of require * Remove testing changes * Use assert instead of require (cherry picked from commit 2ce32f8) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
What does this PR do?
This PR fixes a bug in the Upgrade Watcher's Crash Checker where it was consider the Agent process as healthy (not crashed) despite its PID remaining 0 every time the Crash Checker retrieved the PID from the service.
Why is it important?
To detect when Agent has crashed so the Upgrade Watcher can initiate a rollback.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragments
using the changelog toolI have added an integration test or an E2E testHow to test this PR locally
Manually testing this PR is currently blocked on #3377.
Testing this PR locally is not trivial, but it's possible. It involves upgrading to an Agent build where the Agent binary deliberately crashes, and making sure that the Agent is then rolled back to the previous (pre-upgrade) version.
Install and run Elastic Agent >=
8.10.0
. This is necessary because we need the pre-upgrade Agent to kick off the Upgrade Watcher using the post-upgrade Agent's binary, a change that was implemented in8.10.0
.Checkout this PR and make changes to the
elastic-agent run
code path such that the Agent process will exit with an error. The easiest change would probably be to add a small sleep, say 5 seconds, followed by returning an error, right here:elastic-agent/internal/pkg/agent/cmd/run.go
Lines 65 to 66 in 539a5f2
Also bump up the Agent version to
8.12.0
over here so the upgrade is possible:elastic-agent/version/version.go
Line 7 in 243c76b
Build the Agent package with the changes. Since the Agent version has been bumped up and the corresponding component binaries won't be available, make sure to set
AGENT_DROP_PATH
to nothing. Make sure NOT to useSNAPSHOT=true
otherwise the snapshot artifact downloader will kick in during the upgrade process.Upgrade the running Agent to the built Agent.
Ensure that the Agent is upgrading.
While the upgrade is in progress, watch the Upgrade Watcher's log.
After a couple of minutes, check that the Agent was rolled back.
Related issues