Skip to content

🐛 ClusterResourceSet controller: fix reconciliation triggered by resource changes #4491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 7 additions & 10 deletions exp/addons/controllers/clusterresourceset_controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -304,21 +304,18 @@ metadata:
},
Data: map[string]string{},
}

// Let's wait until the initial reconciliations are *all* done, so we can test
// if the ConfigMap creation triggers a reconciliation. Otherwise a previous change
// of another resource triggers the reconciliation and we cannot verify if we react
// correctly to the ConfigMap creation
time.Sleep(5 * time.Second)
Copy link
Member Author

@sbueringer sbueringer Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been added to ensure the test would consistently fail without the fix in the ClusterRersourceSet controller. If possible I would like to avoid adding a sleep. Are there any ways to detect the previous reconciliations are done? (I think the last one doesn't change anything anywhere, so I cannot really wait for something specific reconciliation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short of wrapping the reconciler to count reconciles (which I would recommend avoiding), I don't think there's really anything that can be done. You also need to account for other controllers and tests in the test suite that may also be creating events, so to get the queue to truly be empty will be very very difficult AFAIK.

If we wanted to be more certain that, we could potentially create a new test with a separate manager watching a different test namespace (I'm not sure this is actually possible here, this is more general advice), that would then be more isolated. In these scenarios it's easier to verify reaction to specific events occurring.

Out of interest, without this, were you seeing the test flaking at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not ideal but could think about something like adding an optional event recorder to the resourceToClusterResourceSet signature that we only pass through in tests that would record the exact resource and operation triggering the reconciliation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about relying on Observed generation instead?

Copy link
Member Author

@sbueringer sbueringer Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed
The problem is that without the timeout the test was not reliably able to detect the broken code. So this means when I drop the timeout the test has a certain probability that it doesn't test the reconcile of the create ConfigMap but one of the previous one either Cluster or ClusterResourceSet creation. So this only makes the test "greener" but just because we're testing the wrong thing. So the test was flaky because it couldn't reliably detect the broken code, with the timeout it can. Usually tests are flaky because the bug in the code occur only under certain conditions, in this case the test code was only testing correctly under certain conditions (i.e. when all previous reconciles where already finished). In case the test code was working correctly the test did always fail.

@enxebre Would it be possible with this solution to detect which exact reconcile of which event lead to the condition we're checking in l.313? If it was already done by a previous reconcile and the "right" reconcile does a no-op, the test is already testing the wrong thing.

@fabriziopandini Do you mean instead of the sleep I should wait until the resources has a specific ObservedGeneration and only after that create the ConfigMap? I'm not entirely sure which reconcilation is currently the problem. I suspect there might be one which is actually a no-op (which then shouldn't change the ObservedGeneration?). I'm not sure if I can wait for that one when it usually does nothing :/.

I think the problem is that I don't have to only detect when the existing resources are steady. I also have to detect if there are some upcoming reconciliations, which are usually no-ops but could reconcile if they are triggered after the ConfigMap creation.

Copy link
Member Author

@sbueringer sbueringer Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini I added

fmt.Printf("reconcile: %v %v\n", req.NamespacedName, clusterResourceSet.ObjectMeta.Generation)

After:

clusterResourceSet := &addonsv1.ClusterResourceSet{}
if err := r.Client.Get(ctx, req.NamespacedName, clusterResourceSet); err != nil {
if apierrors.IsNotFound(err) {
// Object not found, return. Created objects are automatically garbage collected.
// For additional cleanup logic use finalizers.
return ctrl.Result{}, nil
}
// Error reading the object - requeue the request.
return ctrl.Result{}, err
}
// Initialize the patch helper.

The output was:

reconcile: default/clusterresourceset-c5u43l 1
reconcile: default/clusterresourceset-c5u43l 1
reconcile: default/clusterresourceset-c5u43l 1
...
< create configmap >
...
after each
reconcile: default/clusterresourceset-c5u43l 2

So it looks like there are multiple reconciliations with the same observed generation. So I can't really wait until the last one by checking the generation

P.S. previous version of this comment showed a slightly different log, but that was because I tested it incorrectly and didn't realized that the cleanup lead to generation 2.

Copy link
Member

@fabriziopandini fabriziopandini May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer
Catching up after lagging a little bit on this...

I think we should avoid to add a sleep here, because ensuring that watches with partial metadata works as expected is not really something a behavioural test for ClusterResourceSet should be concerned of.

If we want to have test coverage ensuring that watches with partial metadata works well, I think we should have a dedicated test, might be under util/predicates so we can separate this issue from the CRS specific tests.
This will allow us to use a dedicated, super simple, test reconciler that e.g. increases an annotation value at every reconcile, so we can actually check when a reconcile happens.

Does this makes sense? This can be eventually addressed also in a separated PR...


g.Expect(testEnv.Create(ctx, newConfigmap)).To(Succeed())
defer func() {
g.Expect(testEnv.Delete(ctx, newConfigmap)).To(Succeed())
}()

cmKey := client.ObjectKey{
Copy link
Member Author

@sbueringer sbueringer Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the purpose of this Eventually Get was. It should just return instantly because the Create above is synchronous. Maybe it was a try to ensure that the ConfigMap exists in the local cache to fix some test flakiness?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was a try to ensure that the ConfigMap exists in the local cache to fix some test flakiness?

This sounds about right 😅 All the controller runtime stuff uses cached clients normally so if testEnv is also caching, this would be it waiting until the informer has received the create event and updated the cache. Looking at the next Eventually, this basically allows us to know that the timeout is used solely for that reconcile.

Have you tried running this many times to see if you can get it to fail without this extra block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I'll try. Although the 10s timeout is easily enough. So it's highly likely that there won't be any difference.

Copy link
Member Author

@sbueringer sbueringer Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it locally in a loop a few hundred times successfully. But I think we don't need the Get. The Eventually below tries for 20 seconds and locally it takes < 1s. But I don't really care, if the consensus is that we should keep it, I'll just add it again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it back, having a Get to make sure the cache syncs is the most idiomatic way to deal with caches and controllers lagging behind a bit

Namespace: defaultNamespaceName,
Name: newCMName,
}
g.Eventually(func() bool {
m := &corev1.ConfigMap{}
err := testEnv.Get(ctx, cmKey, m)
return err == nil
}, timeout).Should(BeTrue())

// When the ConfigMap resource is created, CRS should get reconciled immediately.
g.Eventually(func() error {
binding := &addonsv1.ClusterResourceSetBinding{}
Expand Down