Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush overall Pipeline log stream when shutting down Jenkins #669

Merged
merged 5 commits into from
Mar 7, 2023

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Mar 3, 2023

Right now, when Jenkins shuts down gracefully, we attempt to pause Pipeline execution and save all related state information. However, we do not flush the listener for the overall log, which means that for LogStorage implementations like FileLogStorage that write the log and log metadata to separate files, and write metadata synchronously while buffering log writes, the metadata and log file may be out of sync after a Jenkins restart.

In particular, when performing a safe restart, we write "Pausing (Preparing for shutdown)" to the build log and pause the Pipeline. For FileLogStorage, if the last log line was written by a step, then we will immediately write the current offset in the log to the metadata file and then flush the metadata file. We will also write the pause message to the log file, but that write will be buffered, and may be lost. On restart, subsequent writes will begin at the log offset that we expected to start with "Pausing (Preparing for shutdown)". For FileLogStorage, this is fine, because the first lines after the restart are guaranteed to come from the overall log, but for more complex implementations of LogStorage that store additional metadata (e.g. line lengths), this desynchronization can cause problems.

TODO:

  • Can this be reproduced in a test?
    • I am not entirely sure why, but so far I have been unable to reproduce the issue in a test (e.g. JenkinsSessionRule + doQuietDown). I can only reproduce the issue while restarting an actual Jenkins instance using /safeRestart.
    • It can be reproduced using RealJenkinsRule, so I added a new test
  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@dwnusbaum dwnusbaum requested a review from a team as a code owner March 3, 2023 21:31
@@ -1694,6 +1694,7 @@ public void pause(final boolean v) throws IOException {
LOGGER.log(Level.WARNING, null, t);
}
});
cpsExec.getOwner().getListener().getLogger().close();
Copy link
Member Author

@dwnusbaum dwnusbaum Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also just flush (I tested both options), but I'm not really sure that it matters either way.

Suggested change
cpsExec.getOwner().getListener().getLogger().close();
cpsExec.getOwner().getListener().getLogger().flush();

@jglick jglick added the bug label Mar 3, 2023
@jglick
Copy link
Member

jglick commented Mar 3, 2023

If JenkinsSessionRule does not work, try RealJenkinsRule.

@@ -1694,6 +1694,7 @@ public void pause(final boolean v) throws IOException {
LOGGER.log(Level.WARNING, null, t);
}
});
cpsExec.getOwner().getListener().getLogger().close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getOwner() guaranteed to be non-null? getListener()? In principle SpotBugs should know but sometimes it seems to miss these things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CpsFlowExecution.owner is guaranteed to be non-null. It's set here and here (called here).

getListener may return WorkflowRun.NULL_LISTENER, which is harmless to close, or throw an exception, but this code already handles exceptions, and I made this be the final operation in the try block so that other more important cleanup still runs even if this throws.

Copy link
Member Author

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If JenkinsSessionRule does not work, try RealJenkinsRule.

Good call, RealJenkinsRule is indeed able to reproduce the issue. I guess JVM reuse across the restart is why this can't be reproduced using JenkinsSessionRule or something? The static FileLogStorage.openStorages field actually ends up returning the same object across the restart? EDIT: Yeah that is exactly the problem, with JenkinsSessionRule we reuse the FileLogStorage object and so we cannot lose any data from before the restart.

Comment on lines +44 to +45
r.assertLogContains("Resuming build at ", b);
r.assertLogContains("Pausing (Preparing for shutdown)", b);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first assertion here passes with or without the fix, since that message is printed after the restart. However, the second assertion is printed right before Jenkins shuts down, and it is missing from the log if you revert the fix.

@dwnusbaum
Copy link
Member Author

Build failure is because program.dat couldn't be loaded on Windows after the restart: java.io.FileNotFoundException: C:\Jenkins\workspace\ugins_workflow-cps-plugin_PR-669\plugin\target\tmp\j h14028995992276197625\jobs\p\builds\1\program.dat (The system cannot find the file specified) Going to try a few rebuilds to see if it is consistent or not.

plugin/pom.xml Outdated Show resolved Hide resolved
@dwnusbaum
Copy link
Member Author

I am merging this now because I would like to get a release out sooner than later, but I'll try to keep an eye on Dependabot PRs and remove the test assumption once we pick up jenkinsci/jenkins-test-harness#559.

@dwnusbaum dwnusbaum merged commit edb8602 into jenkinsci:master Mar 7, 2023
@dwnusbaum dwnusbaum deleted the flush-log-on-shutdown branch March 7, 2023 22:40
@jglick
Copy link
Member

jglick commented Mar 7, 2023

once we pick up jenkinsci/jenkins-test-harness#559

You can simply override jenkins-test-harness.version with a TODO comment asking to delete the override once this gets picked up via plugin-pom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants