Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After the failure cluster is recovered, the residual resources are not deleted #3959

Closed
XiShanYongYe-Chang opened this issue Aug 21, 2023 · 17 comments · Fixed by #4080
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@XiShanYongYe-Chang
Copy link
Member

What happened:

As described in comment: #3808 (comment)

What you expected to happen:

After the failure cluster is recovered, these residual resources can be deleted normally.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Karmada version:
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others:
@XiShanYongYe-Chang
Copy link
Member Author

/cc @chaunceyjiang @lxtywypc @zach593
Can you help take a look?

@zach593
Copy link
Contributor

zach593 commented Aug 22, 2023

how about let execution-controller watch cluster object's unready -> ready? if we are considering enqueue too often, lastTransitionTime could help determine how long the cluster was unavailable.

@XiShanYongYe-Chang
Copy link
Member Author

After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?

What does this strategy specifically refer to?
#3808 (comment)

@zach593
Copy link
Contributor

zach593 commented Aug 22, 2023

After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?

So I got a thought for a while, that make objectWatcher use DeepEqual() checks if there is content changes or not(mutating webhook and API compatibility between karmada and member clusters might be the problem), then it can reduce the frequency of updating member clusters when execution-controller reconciling.

What does this strategy specifically refer to?
#3808 (comment)

This makes every affected items ratelimited, clearing items from the ratelimiter may take a lot of time. And if one day we might let user control the options of ratelimiter and number of async worker retries (or just alter to controller-runtime). In that case, recovery mechanism that rely on dynamic parameters is unreliable.

On the other hand, there's no much difference between watch cluster objects with async worker/controller-runtime's retry mechanism, they both trigger the execution-controller reconciling.

@chaunceyjiang
Copy link
Member

/cc @XiShanYongYe-Chang @lxtywypc @RainbowMango

A new version is about to be released. If this issue is not resolved, I think we may need to revert #3808.

@RainbowMango
Copy link
Member

@XiShanYongYe-Chang @lxtywypc What's your opinion?

@RainbowMango RainbowMango added this to the v1.7 milestone Aug 30, 2023
@XiShanYongYe-Chang
Copy link
Member Author

My thoughts may not be correct, but my viewpoint is that the scope of this issue can be controlled. Can we consider incorporating it into the next version and address the problems mentioned in the current issue? Please correct me if I'm wrong.

@chaunceyjiang
Copy link
Member

According to the description in #3999, the finalizer of work was not removed correctly, which will cause ExecutionSpace not being deleted and thus prevent this cluster from be removed.

I am most concerned about the inability to remove the cluster.

@XiShanYongYe-Chang
Copy link
Member Author

Ok, Thanks
/cc @lxtywypc How do you think?

@lxtywypc
Copy link
Contributor

Hi @RainbowMango @chaunceyjiang @XiShanYongYe-Chang
I'm sorry for replying late.

After the discussion among our team, we think that the core issue is that why we need max retry in AsyncWorker. We could have a further discussion on this later.

And for #3808, I think it's okay to revert it for the new version coming soon. But the core thought of it we think is still right. We hope it could be brought back when we solve the promblem of max retry.

@RainbowMango
Copy link
Member

Great thanks @lxtywypc

I totally agree that the work of #3808 is incredible, yes, definitely, we can bring it back in the next release(1.8).
I also have some ideas about and will leave my thoughts on #3807.

Since we all agree to revert it, who will help to do it?

@lxtywypc
Copy link
Contributor

Since we all agree to revert it, who will help to do it?

I'll do it as soon as possible

@lxtywypc
Copy link
Contributor

I think we could talk about this further more at this time.

Firstly we wonder why we need max retry in AsyncWorker. Kindly ping @XiShanYongYe-Chang , could you help explain it again with more details or examples?

@XiShanYongYe-Chang
Copy link
Member Author

Hi @lxtywypc, I may not be able to explain the reason why AsyncWorker needs to set a maximum retry count, we need @RainbowMango 's help to answer this.

IMOP, maybe we don't need the max retry, directly use the configuration of RateLimitingInterface will be ok.

@XiShanYongYe-Chang
Copy link
Member Author

Hi @lxtywypc, I have talked to @RainbowMango, and we can do not need this max retry. Would you like to update it?

@lxtywypc
Copy link
Contributor

@XiShanYongYe-Chang @RainbowMango Thanks for your replying.

I'm quite glad to help update it. And I will try to bring #3808 back after it. :)

@lxtywypc
Copy link
Contributor

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants