-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize etcd readiness probe #151
Optimize etcd readiness probe #151
Conversation
Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>
39d120c
to
1fe1e87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
As i was on leave, i wasn't part of this thread. So, I might be missing discussion behind this. Here is my opinion on this. Are we trading the correctness here with performance? The initial idea behind the making etcd unready (or internally backup-restore sidecar health Said that, I agree we have to and we could optimise on this at some level by playing it little more intelligently. Like
P.S.: Sorry for commenting on merged PR. Don't want to loose context here. |
@swapnilgm Here the readiness is set only after the sanity check is over and etcd is up. Sanity check should validate connectivity to the blobstore. So, the likelihood of the following full backup upload failing is low. Besides, if the full back upload actually fails then the next readinessProbe should fail, cutting off any further inbound traffic. I see the risk of data loss here as no more than a full snapshot upload failing at any other point in time (other than during a restart). Do you see otherwise? Also, in #150 @shreyas-s-rao has proposed an approach to try a delta snapshot if possible instead of a full snapshot. That should, in theory, minimize the time for a feedback about if the backup could be uploaded. WDYT? |
Agreed. But there could be network issue or context timeouts happening because of large full snapshot uploads which won't be grabbed in sanity check.
Next readiness probe will fail after readiness probe timeout. for single failure it looks minimal window, but over consecutive failure of initial full snapshot upload readiness probe will pass and fail into multiple time window. Cumulatively readiness probe passing time window, will result in etcd availability for long time.
No, data loss because of failure of first full snapshot and later full snapshot is different. Consecutive failure of initial full snapshots will result in etcd revision progressing a lot ahead of last successful snapshot because etcd being ready for long time (not necessarily continuous because of readiness probe but because of cumulative time window explained above). On other hand, failure full snapshot/delta snapshot at other point in time will be not result in data loss at all cause that case is taken care by or supposed to be taken care by initial full snapshot approach.
I agree with proposal of starting watch and delta snapshots. What i didn't get is, why aren't we implementing it? I don't think implementing it involves large efforts/timeline. |
Won't the same thing happen if the full snapshot failed during any other time too? Am I missing something here?
We want to implement it on priority. @shreyas-s-rao wanted to do it already. It is just that we wanted to see if we can do the storage class migration already this week in the dev landscape. It has been pending for 3 weeks now. To me it looked like the delta snapshot on restart looked like large enough effort to postpone it to next week after the migration. Testing effort, not development effort. We do not have automated test cases to cover this scenario yet. Do you see a risk in going ahead with migration without this change? |
What this PR does / why we need it:
This PR optimizes readiness probe on etcd by marking the health endpoint as ready before taking the first full snapshot rather than after, as was being done previously. This reduces etcd downtime during cluster updates.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Release note: