-
Notifications
You must be signed in to change notification settings - Fork 1.4k
🐛 ClusterResourceSet controller: fix reconciliation triggered by resource changes #4491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -304,21 +304,18 @@ metadata: | |
}, | ||
Data: map[string]string{}, | ||
} | ||
|
||
// Let's wait until the initial reconciliations are *all* done, so we can test | ||
// if the ConfigMap creation triggers a reconciliation. Otherwise a previous change | ||
// of another resource triggers the reconciliation and we cannot verify if we react | ||
// correctly to the ConfigMap creation | ||
time.Sleep(5 * time.Second) | ||
|
||
g.Expect(testEnv.Create(ctx, newConfigmap)).To(Succeed()) | ||
defer func() { | ||
g.Expect(testEnv.Delete(ctx, newConfigmap)).To(Succeed()) | ||
}() | ||
|
||
cmKey := client.ObjectKey{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what the purpose of this Eventually Get was. It should just return instantly because the Create above is synchronous. Maybe it was a try to ensure that the ConfigMap exists in the local cache to fix some test flakiness? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This sounds about right 😅 All the controller runtime stuff uses cached clients normally so if Have you tried running this many times to see if you can get it to fail without this extra block? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not yet, I'll try. Although the 10s timeout is easily enough. So it's highly likely that there won't be any difference. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ran it locally in a loop a few hundred times successfully. But I think we don't need the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add it back, having a |
||
Namespace: defaultNamespaceName, | ||
Name: newCMName, | ||
} | ||
g.Eventually(func() bool { | ||
m := &corev1.ConfigMap{} | ||
err := testEnv.Get(ctx, cmKey, m) | ||
return err == nil | ||
}, timeout).Should(BeTrue()) | ||
|
||
// When the ConfigMap resource is created, CRS should get reconciled immediately. | ||
g.Eventually(func() error { | ||
binding := &addonsv1.ClusterResourceSetBinding{} | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been added to ensure the test would consistently fail without the fix in the ClusterRersourceSet controller. If possible I would like to avoid adding a sleep. Are there any ways to detect the previous reconciliations are done? (I think the last one doesn't change anything anywhere, so I cannot really wait for something specific reconciliation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short of wrapping the reconciler to count reconciles (which I would recommend avoiding), I don't think there's really anything that can be done. You also need to account for other controllers and tests in the test suite that may also be creating events, so to get the queue to truly be empty will be very very difficult AFAIK.
If we wanted to be more certain that, we could potentially create a new test with a separate manager watching a different test namespace (I'm not sure this is actually possible here, this is more general advice), that would then be more isolated. In these scenarios it's easier to verify reaction to specific events occurring.
Out of interest, without this, were you seeing the test flaking at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not ideal but could think about something like adding an optional event recorder to the
resourceToClusterResourceSet
signature that we only pass through in tests that would record the exact resource and operation triggering the reconciliationThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about relying on Observed generation instead?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoelSpeed
The problem is that without the timeout the test was not reliably able to detect the broken code. So this means when I drop the timeout the test has a certain probability that it doesn't test the reconcile of the create ConfigMap but one of the previous one either Cluster or ClusterResourceSet creation. So this only makes the test "greener" but just because we're testing the wrong thing. So the test was flaky because it couldn't reliably detect the broken code, with the timeout it can. Usually tests are flaky because the bug in the code occur only under certain conditions, in this case the test code was only testing correctly under certain conditions (i.e. when all previous reconciles where already finished). In case the test code was working correctly the test did always fail.
@enxebre Would it be possible with this solution to detect which exact reconcile of which event lead to the condition we're checking in l.313? If it was already done by a previous reconcile and the "right" reconcile does a no-op, the test is already testing the wrong thing.
@fabriziopandini Do you mean instead of the sleep I should wait until the resources has a specific ObservedGeneration and only after that create the ConfigMap? I'm not entirely sure which reconcilation is currently the problem. I suspect there might be one which is actually a no-op (which then shouldn't change the ObservedGeneration?). I'm not sure if I can wait for that one when it usually does nothing :/.
I think the problem is that I don't have to only detect when the existing resources are steady. I also have to detect if there are some upcoming reconciliations, which are usually no-ops but could reconcile if they are triggered after the ConfigMap creation.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabriziopandini I added
After:
cluster-api/exp/addons/controllers/clusterresourceset_controller.go
Lines 106 to 117 in 2ff65de
The output was:
So it looks like there are multiple reconciliations with the same observed generation. So I can't really wait until the last one by checking the generation
P.S. previous version of this comment showed a slightly different log, but that was because I tested it incorrectly and didn't realized that the cleanup lead to generation 2.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer
Catching up after lagging a little bit on this...
I think we should avoid to add a sleep here, because ensuring that watches with partial metadata works as expected is not really something a behavioural test for ClusterResourceSet should be concerned of.
If we want to have test coverage ensuring that watches with partial metadata works well, I think we should have a dedicated test, might be under
util/predicates
so we can separate this issue from the CRS specific tests.This will allow us to use a dedicated, super simple, test reconciler that e.g. increases an annotation value at every reconcile, so we can actually check when a reconcile happens.
Does this makes sense? This can be eventually addressed also in a separated PR...