Description
Bug Report
Keycloak tests in the same class run in the same namespace and typically just reuse the same parent cr name. The cleanup between tests ensures that the dependent statefulset is gone, but doesn't check for other dependents such as secrets or services.
In a rare case (1% of runs?) there is a timing issue happening where the state of the temporary cache is out of sync with the informer cache in such a way that does not resolve on its own. In particular the temporary cache will show a resource as existing, while the informer cache / server has it as non-existent.
What did you do?
It's difficult to reproduce, turning up logging made it even less likely to occur - with either fabric8 informer or operator sdk debug logging turned on for the reflector / informer it failed to reproduce.
Instead I resorted to some logic that checked the state of getSecondarResource vs a listing of what was in cache when were'nt fully ready after the main resolving method. That confirmed there was a temporary cache entry, but nothing in the informer cache nor on the server.
What did you expect to see?
The temporary cache state in sync.
What did you see instead? Under which circumstances?
In an extremely rare case, and seemingly related to the garbage collection of resource with the same name, the state was out-of-sync.
Somehow the temporary cache is referencing the garbage collected entry - or one that it thought it created shortly after garbage collection, but that the api server mistakenly deletes.
A possible explanation: a create of the resource seems successful so a resource is being returned for that operation. The client then receives an added and delete event in rapid succession - prior to the operator setting the state of the temporary cache. Then once it sets the temporary cache, it is no longer valid nor will there be an event that causes it to correct.
Environment
v4.8.0, Minikube v1.32.0, Kubernetes v1.27.10
Possible Solution
I'm not sure yet. I'd like to capture the full sequence of events to better understand. It could possibly invole tombstoning the temporary cache so that it won't set an entry if it's "too soon" after a delete event.