Optimize etcd readiness probe #151

shreyas-s-rao · 2019-04-26T11:00:09Z

What this PR does / why we need it:
This PR optimizes readiness probe on etcd by marking the health endpoint as ready before taking the first full snapshot rather than after, as was being done previously. This reduces etcd downtime during cluster updates.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Release note:

Reduced etcd downtime by optimizing readiness probe.

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

georgekuruvillak

lgtm

swapnilgm · 2019-05-02T05:12:42Z

As i was on leave, i wasn't part of this thread. So, I might be missing discussion behind this. Here is my opinion on this.

Are we trading the correctness here with performance? The initial idea behind the making etcd unready (or internally backup-restore sidecar health false) until first snapshot was, basically if we face some issue with snapshot upload the sidecar will fail to upload. It will go in multiple restarts without any snapshot uploads. At the same time with current PR since we are allowing etcd to be ready to serve, it will continue accepting new user data. So, till the time somebody actually resolves (may be manual intervention or gardener reconciliation required cause, most probable reason for failing to upload snapshot is wrong credentials) the issue with backup-restore-sidecar, etcd revision might progressed a lot ahead of last successful snapshot. So, directly making etcd ready doesn't sound's good to me.

Said that, I agree we have to and we could optimise on this at some level by playing it little more intelligently. Like

if restoration happened then no need wait until first full snapshot and mark backup-restore healthy and make etcd ready.
if there isn't any restoration but it restarted because of failed to upload last snapshot, pod reschedule or version update, we can check the difference between current etcd revision and last snapshot upload revision and if this is beyond some threshold only then take full snapshot before marking sidecar healthy or making etcd ready. The threshold here can be configured based on the delta snapshot revision change observation.

P.S.: Sorry for commenting on merged PR. Don't want to loose context here.

amshuman-kr · 2019-05-02T05:35:13Z

@swapnilgm Here the readiness is set only after the sanity check is over and etcd is up. Sanity check should validate connectivity to the blobstore. So, the likelihood of the following full backup upload failing is low.

Besides, if the full back upload actually fails then the next readinessProbe should fail, cutting off any further inbound traffic. I see the risk of data loss here as no more than a full snapshot upload failing at any other point in time (other than during a restart). Do you see otherwise?

Also, in #150 @shreyas-s-rao has proposed an approach to try a delta snapshot if possible instead of a full snapshot. That should, in theory, minimize the time for a feedback about if the backup could be uploaded. WDYT?

swapnilgm · 2019-05-02T06:01:09Z

@swapnilgm Here the readiness is set only after the sanity check is over and etcd is up. Sanity check should validate connectivity to the blobstore. So, the likelihood of the following full backup upload failing is low.

Agreed. But there could be network issue or context timeouts happening because of large full snapshot uploads which won't be grabbed in sanity check.

Besides, if the full back upload actually fails then the next readinessProbe should fail, cutting off any further inbound traffic. This will be very rare but backup-restore is supposed to handle rare cases.

Next readiness probe will fail after readiness probe timeout. for single failure it looks minimal window, but over consecutive failure of initial full snapshot upload readiness probe will pass and fail into multiple time window. Cumulatively readiness probe passing time window, will result in etcd availability for long time.

I see the risk of data loss here as no more than a full snapshot upload failing at any other point in time (other than during a restart). Do you see otherwise?

No, data loss because of failure of first full snapshot and later full snapshot is different. Consecutive failure of initial full snapshots will result in etcd revision progressing a lot ahead of last successful snapshot because etcd being ready for long time (not necessarily continuous because of readiness probe but because of cumulative time window explained above). On other hand, failure full snapshot/delta snapshot at other point in time will be not result in data loss at all cause that case is taken care by or supposed to be taken care by initial full snapshot approach.

Also, in #150 @shreyas-s-rao has proposed an approach to try a delta snapshot if possible instead of a full snapshot. That should, in theory, minimise the time for a feedback about if the backup could be uploaded. WDYT?

I agree with proposal of starting watch and delta snapshots. What i didn't get is, why aren't we implementing it? I don't think implementing it involves large efforts/timeline.

amshuman-kr · 2019-05-02T06:17:49Z

Next readiness probe will fail after readiness probe timeout. for single failure it looks minimal window, but over consecutive failure of initial full snapshot upload readiness probe will pass and fail into multiple time window. Cumulatively readiness probe passing time window, will result in etcd availability for long time.

Won't the same thing happen if the full snapshot failed during any other time too? Am I missing something here?

I agree with proposal of starting watch and delta snapshots. What i didn't get is, why aren't we implementing it? I don't think implementing it involves large efforts/timeline.

We want to implement it on priority. @shreyas-s-rao wanted to do it already. It is just that we wanted to see if we can do the storage class migration already this week in the dev landscape. It has been pending for 3 weeks now. To me it looked like the delta snapshot on restart looked like large enough effort to postpone it to next week after the migration. Testing effort, not development effort. We do not have automated test cases to cover this scenario yet.

Do you see a risk in going ahead with migration without this change?

shreyas-s-rao added this to the 0.7.0 milestone Apr 26, 2019

shreyas-s-rao requested a review from georgekuruvillak April 26, 2019 11:00

shreyas-s-rao requested a review from swapnilgm as a code owner April 26, 2019 11:00

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 26, 2019

Optimize etcd readiness probe

1fe1e87

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

shreyas-s-rao force-pushed the optimize-readiness-probe branch from 39d120c to 1fe1e87 Compare April 29, 2019 09:45

shreyas-s-rao added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019

shreyas-s-rao added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019

georgekuruvillak approved these changes Apr 29, 2019

View reviewed changes

georgekuruvillak merged commit 758f865 into gardener:master Apr 29, 2019

shreyas-s-rao mentioned this pull request Apr 30, 2019

Cherry pick etcd readiness probe optimization and alpine version update #153

Merged

shreyas-s-rao mentioned this pull request Jun 27, 2023

Multi-Node/Clustered ETCD gardener/etcd-druid#107

Closed

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize etcd readiness probe #151

Optimize etcd readiness probe #151

shreyas-s-rao commented Apr 26, 2019

georgekuruvillak left a comment

swapnilgm commented May 2, 2019

amshuman-kr commented May 2, 2019

swapnilgm commented May 2, 2019

amshuman-kr commented May 2, 2019

Optimize etcd readiness probe #151

Optimize etcd readiness probe #151

Conversation

shreyas-s-rao commented Apr 26, 2019

georgekuruvillak left a comment

Choose a reason for hiding this comment

swapnilgm commented May 2, 2019

amshuman-kr commented May 2, 2019

swapnilgm commented May 2, 2019

amshuman-kr commented May 2, 2019