After the failure cluster is recovered, the residual resources are not deleted #3959

XiShanYongYe-Chang · 2023-08-21T02:00:37Z

What happened:

As described in comment: #3808 (comment)

What you expected to happen:

After the failure cluster is recovered, these residual resources can be deleted normally.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Karmada version:
kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
Others:

The text was updated successfully, but these errors were encountered:

XiShanYongYe-Chang · 2023-08-21T03:26:11Z

/cc @chaunceyjiang @lxtywypc @zach593
Can you help take a look?

zach593 · 2023-08-22T01:57:34Z

how about let execution-controller watch cluster object's unready -> ready? if we are considering enqueue too often, lastTransitionTime could help determine how long the cluster was unavailable.

XiShanYongYe-Chang · 2023-08-22T02:07:30Z

After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?

What does this strategy specifically refer to?
#3808 (comment)

zach593 · 2023-08-22T02:54:04Z

After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?

So I got a thought for a while, that make objectWatcher use DeepEqual() checks if there is content changes or not(mutating webhook and API compatibility between karmada and member clusters might be the problem), then it can reduce the frequency of updating member clusters when execution-controller reconciling.

What does this strategy specifically refer to?
#3808 (comment)

This makes every affected items ratelimited, clearing items from the ratelimiter may take a lot of time. And if one day we might let user control the options of ratelimiter and number of async worker retries (or just alter to controller-runtime). In that case, recovery mechanism that rely on dynamic parameters is unreliable.

On the other hand, there's no much difference between watch cluster objects with async worker/controller-runtime's retry mechanism, they both trigger the execution-controller reconciling.

chaunceyjiang · 2023-08-30T10:20:09Z

/cc @XiShanYongYe-Chang @lxtywypc @RainbowMango

A new version is about to be released. If this issue is not resolved, I think we may need to revert #3808.

RainbowMango · 2023-08-30T10:42:24Z

@XiShanYongYe-Chang @lxtywypc What's your opinion?

XiShanYongYe-Chang · 2023-08-30T11:40:08Z

My thoughts may not be correct, but my viewpoint is that the scope of this issue can be controlled. Can we consider incorporating it into the next version and address the problems mentioned in the current issue? Please correct me if I'm wrong.

chaunceyjiang · 2023-08-30T11:47:14Z

According to the description in #3999, the finalizer of work was not removed correctly, which will cause ExecutionSpace not being deleted and thus prevent this cluster from be removed.

I am most concerned about the inability to remove the cluster.

XiShanYongYe-Chang · 2023-08-30T12:01:44Z

Ok, Thanks
/cc @lxtywypc How do you think?

lxtywypc · 2023-08-31T04:09:26Z

Hi @RainbowMango @chaunceyjiang @XiShanYongYe-Chang
I'm sorry for replying late.

After the discussion among our team, we think that the core issue is that why we need max retry in AsyncWorker. We could have a further discussion on this later.

And for #3808, I think it's okay to revert it for the new version coming soon. But the core thought of it we think is still right. We hope it could be brought back when we solve the promblem of max retry.

RainbowMango · 2023-08-31T04:52:24Z

Great thanks @lxtywypc

I totally agree that the work of #3808 is incredible, yes, definitely, we can bring it back in the next release(1.8).
I also have some ideas about and will leave my thoughts on #3807.

Since we all agree to revert it, who will help to do it?

lxtywypc · 2023-08-31T05:22:36Z

Since we all agree to revert it, who will help to do it?

I'll do it as soon as possible

lxtywypc · 2023-09-13T03:42:35Z

I think we could talk about this further more at this time.

Firstly we wonder why we need max retry in AsyncWorker. Kindly ping @XiShanYongYe-Chang , could you help explain it again with more details or examples?

XiShanYongYe-Chang · 2023-09-13T07:57:19Z

Hi @lxtywypc, I may not be able to explain the reason why AsyncWorker needs to set a maximum retry count, we need @RainbowMango 's help to answer this.

IMOP, maybe we don't need the max retry, directly use the configuration of RateLimitingInterface will be ok.

XiShanYongYe-Chang · 2023-09-14T03:02:19Z

Hi @lxtywypc, I have talked to @RainbowMango, and we can do not need this max retry. Would you like to update it?

lxtywypc · 2023-09-14T03:11:46Z

@XiShanYongYe-Chang @RainbowMango Thanks for your replying.

I'm quite glad to help update it. And I will try to bring #3808 back after it. :)

lxtywypc · 2023-09-14T03:12:05Z

/assign

XiShanYongYe-Chang added the kind/bug Categorizes issue or PR as related to a bug. label Aug 21, 2023

XiShanYongYe-Chang mentioned this issue Aug 21, 2023

only update object in member cluster by execution controller #3808

Merged

lxtywypc mentioned this issue Aug 25, 2023

execution_controller drop worker key #3999

Closed

RainbowMango added this to the v1.7 milestone Aug 30, 2023

This was referenced Aug 31, 2023

Revert "Merge pull request #3808 from ctripcloud/refactor-execution-workstatus" #4016

Merged

remove unnecessary needUpdate to avoid unexpected skipping updating before cache synced #3994

Draft

RainbowMango modified the milestones: v1.7, v1.8 Sep 11, 2023

karmada-bot assigned lxtywypc Sep 14, 2023

lxtywypc mentioned this issue Sep 21, 2023

remove unnecessary maxRetry for AsyncWorker #4080

Merged

karmada-bot closed this as completed in #4080 Oct 10, 2023

lxtywypc mentioned this issue Nov 13, 2023

Revert "Revert "Merge pull request #3808 from ctripcloud/refactor-execution-workstatus"" #4227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After the failure cluster is recovered, the residual resources are not deleted #3959

After the failure cluster is recovered, the residual resources are not deleted #3959

XiShanYongYe-Chang commented Aug 21, 2023

XiShanYongYe-Chang commented Aug 21, 2023

zach593 commented Aug 22, 2023

XiShanYongYe-Chang commented Aug 22, 2023

zach593 commented Aug 22, 2023

chaunceyjiang commented Aug 30, 2023

RainbowMango commented Aug 30, 2023

XiShanYongYe-Chang commented Aug 30, 2023

chaunceyjiang commented Aug 30, 2023

XiShanYongYe-Chang commented Aug 30, 2023

lxtywypc commented Aug 31, 2023

RainbowMango commented Aug 31, 2023

lxtywypc commented Aug 31, 2023

lxtywypc commented Sep 13, 2023

XiShanYongYe-Chang commented Sep 13, 2023

XiShanYongYe-Chang commented Sep 14, 2023

lxtywypc commented Sep 14, 2023

lxtywypc commented Sep 14, 2023

After the failure cluster is recovered, the residual resources are not deleted #3959

After the failure cluster is recovered, the residual resources are not deleted #3959

Comments

XiShanYongYe-Chang commented Aug 21, 2023

XiShanYongYe-Chang commented Aug 21, 2023

zach593 commented Aug 22, 2023

XiShanYongYe-Chang commented Aug 22, 2023

zach593 commented Aug 22, 2023

chaunceyjiang commented Aug 30, 2023

RainbowMango commented Aug 30, 2023

XiShanYongYe-Chang commented Aug 30, 2023

chaunceyjiang commented Aug 30, 2023

XiShanYongYe-Chang commented Aug 30, 2023

lxtywypc commented Aug 31, 2023

RainbowMango commented Aug 31, 2023

lxtywypc commented Aug 31, 2023

lxtywypc commented Sep 13, 2023

XiShanYongYe-Chang commented Sep 13, 2023

XiShanYongYe-Chang commented Sep 14, 2023

lxtywypc commented Sep 14, 2023

lxtywypc commented Sep 14, 2023