Document true maximum cutover timeout length #904

hakusaro · 2020-12-11T01:13:01Z

This is the commit message, soz for formatting in the PR:

When attempting a gh-ost migration, I was observed that gh-ost sets the
lock timeout to 6 seconds, not 3 seconds, when attempting to lock
tables. This is potentially dangerous, because a user may set this flag
to 10 seconds and expect that the max table unavailability is a total of
10 seconds, when it's really 20 seconds. I observed this happening quite
clearly, as gh-ost was unable to obtain the first stage lock several
times. I didn't realize the docs didn't account for this total lock time.

(Now, they do!)

I totally get why you would want issues to accompany PRs, but I feel like this is an easily solved problem. I can make an issue too, but I don't think it will realistically matter too much other than potentially pushing the default value down

When attempting a gh-ost migration, I was observed that gh-ost sets the lock timeout to 6 seconds, not 3 seconds, when attempting to lock tables. This is potentially dangerous, because a user may set this flag to 10 seconds and expect that the max table unavailability is a total of 10 seconds, when it's really 20 seconds. I observed this happening quite clearly, as gh-ost was unable to obtain the first stage lock several times. I didn't realize the docs didn't account for this total lock time. (Now, they do!)

timvaillancourt · 2022-08-18T20:41:52Z

@hakusaro thanks for this PR! I agree it doesn't really need an issue 👍

Were you able to explain why the value is doubled by gh-ost and where in the code this is happening? I'm curious if this behaviour is unintentional and/or a bug. Thanks!

hakusaro · 2022-08-18T21:26:28Z

@timvaillancourt I don't quite remember how I arrived at this calculation, but it only happened because I was in the exact state described here: setting the timeout to e.g., 10 seconds doubled it to 20 seconds because if it fully locks up for the duration of time you specify and it backs off and has to try again, you end up paying the cost twice. I think it's somewhere like here but the context is kinda lost now.

For some context, at @polleverywhere we switched to gh-ost as an alternative to LHM specifically because we were having lock contention at the exact cutover point. When using LHM, this resulted in a 60-second table outage and eventually retrying would defeat the lock, but it was way too dangerous. gh-ost gave us a way to define the exact outage period. I definitely know this caused 6-second table outages with the smallest duration.

All I can say is that while gh-ost is graceful here, it's just slightly less graceful as unwinding the cutover takes longer than expected. I'd be happy to dive further in but I'm not an expert in the codebase, I just know what we had happen. It may have changed in the time since I sent this PR and now, but I'm not certain.

hakusaro · 2022-08-18T21:28:27Z

It might be easier to find the cause by debugging and specifically inducing a lock contention problem (lock the table up, try to migrate, hit the cutover timeout, and see which code path is taken) and seeing how gh-ost behaves recovering. It works perfectly in our experience, it's just twice as long as we expected to get the table back into a non-outage state.

timvaillancourt · 2022-08-21T23:50:28Z

@hakusaro ok thanks!

Will investigate to see if a bug fix could make this behave as expected 👍

hakusaro force-pushed the patch-1 branch from 33892cc to 23c24d5 Compare December 11, 2020 01:14

hakusaro force-pushed the patch-1 branch from 23c24d5 to 44e9916 Compare December 11, 2020 01:15

timvaillancourt added the documentation label May 4, 2021

Merge branch 'master' into patch-1

89ae978

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document true maximum cutover timeout length #904

Document true maximum cutover timeout length #904

Uh oh!

hakusaro commented Dec 11, 2020

Uh oh!

timvaillancourt commented Aug 18, 2022

Uh oh!

hakusaro commented Aug 18, 2022

Uh oh!

hakusaro commented Aug 18, 2022

Uh oh!

timvaillancourt commented Aug 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Document true maximum cutover timeout length #904

Are you sure you want to change the base?

Document true maximum cutover timeout length #904

Uh oh!

Conversation

hakusaro commented Dec 11, 2020

Uh oh!

timvaillancourt commented Aug 18, 2022

Uh oh!

hakusaro commented Aug 18, 2022

Uh oh!

hakusaro commented Aug 18, 2022

Uh oh!

timvaillancourt commented Aug 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants