Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-67237] BuildTrigger waits until the dependency graph has been updated #6101

Merged
merged 10 commits into from
Mar 29, 2022

Conversation

Si-So
Copy link
Contributor

@Si-So Si-So commented Dec 23, 2021

See JENKINS-67237.

Proposed changelog entries

  • When triggering a new build while the build graph is currently being re-computed, jenkins waits for the re-computation to finish. This guarantees that recently updated build triggers are executed.

Proposed upgrade guidelines

N/A

Submitter checklist

  • (If applicable) Jira issue is well described
  • Changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developer, depending on the change). Examples
    • Fill-in the Proposed changelog entries section only if there are breaking changes or other changes which may require extra steps from users during the upgrade
  • Appropriate autotests or explanation to why this change has no tests
  • For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@mention

Maintainer checklist

Before the changes are marked as ready-for-merge:

  • There are at least 2 approvals for the pull request and no outstanding requests for change
  • Conversations in the pull request are over OR it is explicit that a reviewer does not block the change
  • Changelog entries in the PR title and/or Proposed changelog entries are correct
  • Proper changelog labels are set so that the changelog can be generated automatically
  • If the change needs additional upgrade steps from users, upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the PR title. (example)
  • If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).

@oleg-nenashev oleg-nenashev requested review from a team and oleg-nenashev December 24, 2021 00:43
@oleg-nenashev
Copy link
Member

It will definitely need to be added to the changelog. Seem to be rather a bug than an enhancement

@oleg-nenashev oleg-nenashev added the bug For changelog: Minor bug. Will be listed after features label Dec 24, 2021
@Si-So
Copy link
Contributor Author

Si-So commented Jan 4, 2022

I am assuming that it is my responsibility to fix the failing check. Unfortunately, I was unable to reproduce the error message both on my Windows and Linux machine and therefore I can not prepare a fix.
Here's my setup:
JDK 11
maven 3.8.4
maven-checkstyle-plugin:3.1.2

I suspect that the problem might be related to control characters. When doing a fresh git checkout on my Linux machine, I can not see anything wrong with it (e.g.: all line endings of a freshly checked out Jenkins.java are LF).

@timja
Copy link
Member

timja commented Jan 4, 2022

The build log tells you what to do. Search 'fix'

@Si-So
Copy link
Contributor Author

Si-So commented Jan 4, 2022

The problem is that the build log of my Linux machine and the build log from ci.jenkins.io differ:
On ci.jenkins.io the problem occurs in the file Jenkins.java in the reactor "Jenkins core".

On my machine, the problem occurs in the file JenkinsFutureDependencyGraphTest.java in the reactor "Tests for Jenkins core".

I have attached a minimal log file that I generated by executing mvn com.diffplug.spotless:spotless-maven-plugin:3.1.2:check. Just to be sure, I also attached a full log-file which I generated by executing mvn --batch-mode --show-version --errors --no-transfer-progress -Pdebug -Pjapicmp --update-snapshots -Dmaven.test.failure.ignore -Dspotbugs.failOnError=false -Dcheckstyle.failOnViolation=false -Dset.changelist help:evaluate -Dexpression=changelist --log-file ../maven-log.txt -DskipTests clean install

I am aware that I can fix the format violation by running mvn spotless:apply, but since my machine complains about a different format violation than ci.jenkins.io, I assume that the error will persist. I believe ci.jenkins.io did not find the format violations that my machine found, because it aborted the build earlier.

I could fix the format violation on my machine and commit the change, but I ideally want to fix all format violations in one commit.

Should I run mvn spotless:apply locally, create a commit and push it, even though I do not believe this will fix the format violation found by ci.jenkins.io?

maven-full-log.txt
maven-short-log.txt

@timja
Copy link
Member

timja commented Jan 4, 2022

try it with Java 8, spotless will use a newer version on Java 11

@Si-So
Copy link
Contributor Author

Si-So commented Jan 6, 2022

That took me some time to figure out.

Between creating my fork and today, new format rules were introduced in the following PRs:
#6149
#6147

Since my fork only had the old pom.xml, my machines did not know about the new format rules, whereas ci.jenkins.io knew.

I solved the problem by copying the pom.xml over from the current master branch. I then saw identical format violations on my machine and ci.jenkins.io and could fix them.

Changelog entry:

When triggering a new build while the build graph is currently being re-computed, jenkins waits for the re-computation to finish. This gurantees that recently updated build triggers are executed.

@timja timja changed the title [Jenkins 67237] BuildTrigger waits until the dependency graph has been updated [JENKINS-67237] BuildTrigger waits until the dependency graph has been updated Jan 6, 2022
@Si-So
Copy link
Contributor Author

Si-So commented Jan 7, 2022

To fix the test failure, I first tried to replicate it on my Windows machine. Unfortunately, the test does not fail on my machine :/

Strangely enough other tests fail on my machine that do not fail on ci.jenkion.io.

It seems that jenkins can not dispose the test environment and I don't know how my changes could have caused that.

Is it possible that disposing the test environment is flaky? Maybe a dumb question, but I want to make sure before investigating further.

@Si-So
Copy link
Contributor Author

Si-So commented Feb 9, 2022

I further looked into the possible causes for the flakiness of the tests. Sometimes the log files cannot be deleted after the test has been run which causes the test to fail. I will explain how this can occur by using hudson.model.QueueTest.persistenceBlockedItem as an example.

In this test, the jenkins instance is shut down while a build is running. Sometimes (this is the reason why the test is flaky) the job is in the cleanup phase (AbstractBuildExecution.cleanUp) in which the possible downstream projects are determined (BuildTrigger.execute).

Due to my changes, the BuildTrigger.execute method can now take much longer than before since it is waiting for the calculation of the dependency graph to finish (the calculation of the dependency graph was started by the creation of a job earlier in this test). In BuildTrigger.execute a write to the log file occurs. When the shutdown occurs during writing to the log file, the JenkinsRule cannot delete the log file (since it is still being written to) and the test fails.

The reason why I could not replicate this problem on my local machine is that it is sensitive to timing. My machine is either too slow or too fast compared to ci.jenkins.io.

I can think of three possible solutions:

  1. Modify the flaky tests. This can be done in multiple ways. One is to wait until the dependency graph calculation has finished (I fixed one of the flaky tests in this way in my last commit).
  2. Modify the JenkinsRule.after method in such a way that it waits until the log file is no longer being written to.
  3. Change the business logic of either BuildTrigger.execute or Jenkins.cleanUp: One solution would be to wait in Jenkins.cleanUp to cancel all dependency graph calculations that are still ongoing.

I believe option 3 might be the best solution, but I need feedback on how to proceed.

Two things made it hard for me to make progress on this issue:
a) I cannot reproduce the problem on my machine. I think this is intrinsic to the problem and cannot be fixed :'(
b) I cannot start Unit-Test from eclipse. I think this can be fixed, but I don't know how. Help is greatly appreciated.

I followed both https://wiki.jenkins.io/JENKINS/Building-Jenkins.html and https://wiki.jenkins.io/JENKINS/Setting-up-Eclipse-to-build-Jenkins.html but to no success. When executing a unit-test from eclipse I get a java.lang.ClassNotFoundException: hudson.model.QueueTest exception.

I helped myself by starting the test from mvn like so mvn -Dmaven.surefire.debug -Dtest=QueueTest#persistenceBlockedItem test and then connecting via remote debugging. Another annoying thing that happens when I go this route is that eclipse is confused by the presence of two files named QueueTest and always picks the wrong one when halting at a breakpoint...

@timja
Copy link
Member

timja commented Feb 9, 2022

Can't really help on your questions, but:

I helped myself by starting the test from mvn like so mvn -Dmaven.surefire.debug -Dtest=QueueTest#persistenceBlockedItem test and then connecting via remote debugging. Another annoying thing that happens when I go this route is that eclipse is confused by the presence of two files named QueueTest and always picks the wrong one when halting at a breakpoint...

IntelliJ also fails to run Jenkins core tests in general for me at least. I use the same approach you did with -Dmaven.surefire.debug and mvn test, although not hit the other issue you mentioned.

…ng shutdown to avoid exceptions during JenkinsRule.after
@Si-So
Copy link
Contributor Author

Si-So commented Feb 22, 2022

Okay, I added a new method to Jenkins called _cleanUpCancelDependencyGraphCalculation().

This method cancels an ongoing dependency graph calculation and thus the flaky tests are once more stable. Canceling dependency graph calculation during shutdown is unproblematic since they are re-calculated during startup anyway.

This PR is good to go from my POV.

@Si-So
Copy link
Contributor Author

Si-So commented Feb 24, 2022

@jenkinsci/core-pr-reviewers

@timja timja requested a review from a team February 24, 2022 08:45
@Si-So
Copy link
Contributor Author

Si-So commented Mar 18, 2022

Can I support the review process in any way?

}


public DependencyGraph createNewDependencyGraph() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just inline this call into its callers, we don't want to add this new public API since no one would find it useful. If you need it for testing could you make it package scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make it package scope. The relevant unit-test is located in jenkins-test, because I need JenkinsRule which is only available in jenkins-test.

By modifying the unit-test, I could do without the method.

Thanks for your feedback.

@@ -3917,6 +3921,20 @@ private void _cleanUpReleaseAllLoggers(List<Throwable> errors) {
}
}

private void _cleanUpCancelDependencyGraphCalculation() {
LOGGER.log(Level.FINE, "Canceling internal dependency graph calculation");
if (scheduledFutureDependencyGraph != null && !scheduledFutureDependencyGraph.isDone()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to syncronize the entire block because scheduledFutureDependency graph can become null in between the nullcheck and the isDone call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it. Thank you for your feedback :)


// Mark that we finished calculating the dependency graph
synchronized (dependencyGraphLock) {
calculatingFutureDependencyGraph = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should not null this out here, we don't nullcheck this for any scheduling purpose.

Not nulling this will allow the above code to probably be simplified to

if (calculatingFutureDependencyGraph != null) {
    calculatingFutureDependencyGraph.get();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now making sure that calculatingFutureDependencyGraph never becomes null. This made this null check obsolete and some other checks became obsolete as well and I deleted them. Thank you for your feedback.

Si-So added 2 commits March 23, 2022 09:35
…atingFutureDependencyGraph never becomes null.

Fixed a synchronization bug in _cleanUpCancelDependencyGraphCalculation().
Slightly altered unit-test, because the behavior of Jenkins.getFutureDependencyGraph() changed when no graph has been built beforehand.
…and subsequently altered the corresponding unit-test
@Si-So Si-So requested a review from res0nance March 24, 2022 08:37
@Si-So
Copy link
Contributor Author

Si-So commented Mar 24, 2022

I think the Windows Build failed, because the Windows machine is overwhelmed. I came to this conclusion, because the Windows build on other PRs also failed.

@timja
Copy link
Member

timja commented Mar 27, 2022

@res0nance you able to take another look?

@timja timja requested a review from jglick March 27, 2022 08:09
@timja
Copy link
Member

timja commented Mar 27, 2022

@jglick you know anything about this area?

@jglick jglick removed their request for review March 27, 2022 14:56
@jglick
Copy link
Member

jglick commented Mar 27, 2022

you know anything about this area?

Not much, and I am not inclined to learn. At one point I had thought to port DependencyGraph to Pipeline (JENKINS-29913) but it never seemed pressing. The DependencyGraph system is rather complex—something that probably ought to have been done in a plugin, but would be difficult to split now due to old API design choices. Many (though certainly not all) use cases can be expressed more clearly and directly in Pipeline using control flow constructs within a single job with multiple stages; or by having an orchestration job use build steps to launch other jobs at specific times under defined conditions. For the cases where dependency relationships between independent and autonomous jobs are most appropriate, another option is to use triggers based on external events (such as deployments to artifact repositories); or domain-specific features more sensitive to the nuances of particular build systems, like https://github.com/jenkinsci/pipeline-maven-plugin#feature-trigger-downstream.

…onger lock during re-calculation of dependency graph.
@Si-So Si-So requested a review from res0nance March 28, 2022 07:42
Copy link
Contributor

@res0nance res0nance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@timja timja added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Mar 28, 2022
@timja
Copy link
Member

timja commented Mar 28, 2022

This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback.

Thanks!

@basil basil merged commit 292382f into jenkinsci:master Mar 29, 2022
@basil
Copy link
Member

basil commented Apr 5, 2022

I am afraid this may have caused the regression noted in JENKINS-68189.

@res0nance
Copy link
Contributor

I'm guessing that this is because rebuildDependencyGraph is called here
https://github.com/Si-So/jenkins/blob/a83e40d3cceab702b8024710adcd29d4cac7f1f0/core/src/main/java/jenkins/model/Jenkins.java#L3434

Maybe if we call the async version with a get this will update the calculatingFutureDependencyGraph which at the start holds EMPTY

@jglick
Copy link
Member

jglick commented Apr 6, 2022

Whatever the case, standard procedure would be for this PR to be reverted promptly, and then a separate PR filed as time permits to reintroduce the change along with a fix for the regression and a new functional test covering the broken scenario.

@Si-So
Copy link
Contributor Author

Si-So commented Apr 6, 2022

I second that. I thought I had enough time today to look at the issue, but didn't.

When I have time I will both contribute a new test and a fix.

@basil
Copy link
Member

basil commented Apr 6, 2022

In the absence of a fix and a new test, I plan to revert this PR on Friday.

@res0nance
Copy link
Contributor

Whatever the case, standard procedure would be for this PR to be reverted promptly, and then a separate PR filed as time permits to reintroduce the change along with a fix for the regression and a new functional test covering the broken scenario.

Agreed we should just revert this

@Si-So
Copy link
Contributor Author

Si-So commented Apr 7, 2022

I was able to reproduce the bug on my machine and I am working now on both the test case and a fix. The merge should be reverted because the new bug is much more severe than the bug that I fixed.

Steps to reproduce the bug:

  1. Start a Jenkins instance.
  2. Create Job_A.
  3. Create Job_B. Job_B is triggered by Job_A.
  4. Restart Jenkins.
  5. A build of Job_B is no longer triggered after Job_A has been built.

This bug does occur even if only one node is used.

Analysis:

res0nance already found the cause of the bug. After restarting Jenkins, rebuildDependencyGraphAsync() is never triggered and thus the calculatingFutureDependencyGraph remains EMPTY until the dependency graph has been re-computed asynchronously.

MarkEWaite added a commit that referenced this pull request Apr 7, 2022
MarkEWaite added a commit to MarkEWaite/jenkins that referenced this pull request Apr 7, 2022
…ph has been updated (jenkinsci#6101)"

https://issues.jenkins.io/browse/JENKINS-68189 reports that downstream
jobs are no longer executed in Jenkins 2.341.

https://issues.jenkins.io/browse/JENKINS-67237 reports that Downstream
jobs are not guaranteed to build when the job configs have been updated
recently.

Comments in jenkinsci#6101 recommend
that this pull request be reverted while a fix is being developed.
This change is the revert of the pull request.

The revert was generated by pressing the "revert" button on the GitHub
interface, then using the resulting commit in this pull request.

This reverts commit 292382f.
MarkEWaite added a commit that referenced this pull request Apr 8, 2022
Revert "[JENKINS-67237] `BuildTrigger` waits until the dependency graph has been updated (#6101)"
Si-So added a commit to Si-So/jenkins that referenced this pull request Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants