Skip to content

Conversation

hakusaro
Copy link

This is the commit message, soz for formatting in the PR:

When attempting a gh-ost migration, I was observed that gh-ost sets the
lock timeout to 6 seconds, not 3 seconds, when attempting to lock
tables. This is potentially dangerous, because a user may set this flag
to 10 seconds and expect that the max table unavailability is a total of
10 seconds, when it's really 20 seconds. I observed this happening quite
clearly, as gh-ost was unable to obtain the first stage lock several
times. I didn't realize the docs didn't account for this total lock time.

(Now, they do!)

I totally get why you would want issues to accompany PRs, but I feel like this is an easily solved problem. I can make an issue too, but I don't think it will realistically matter too much other than potentially pushing the default value down

When attempting a gh-ost migration, I was observed that gh-ost sets the
lock timeout to 6 seconds, not 3 seconds, when attempting to lock
tables. This is potentially dangerous, because a user may set this flag
to 10 seconds and expect that the max table unavailability is a total of
10 seconds, when it's really 20 seconds. I observed this happening quite
clearly, as gh-ost was unable to obtain the first stage lock several
times. I didn't realize the docs didn't account for this total lock time.

(Now, they do!)
@timvaillancourt
Copy link
Collaborator

@hakusaro thanks for this PR! I agree it doesn't really need an issue 👍

Were you able to explain why the value is doubled by gh-ost and where in the code this is happening? I'm curious if this behaviour is unintentional and/or a bug. Thanks!

@hakusaro
Copy link
Author

@timvaillancourt I don't quite remember how I arrived at this calculation, but it only happened because I was in the exact state described here: setting the timeout to e.g., 10 seconds doubled it to 20 seconds because if it fully locks up for the duration of time you specify and it backs off and has to try again, you end up paying the cost twice. I think it's somewhere like here but the context is kinda lost now.

For some context, at @polleverywhere we switched to gh-ost as an alternative to LHM specifically because we were having lock contention at the exact cutover point. When using LHM, this resulted in a 60-second table outage and eventually retrying would defeat the lock, but it was way too dangerous. gh-ost gave us a way to define the exact outage period. I definitely know this caused 6-second table outages with the smallest duration.

All I can say is that while gh-ost is graceful here, it's just slightly less graceful as unwinding the cutover takes longer than expected. I'd be happy to dive further in but I'm not an expert in the codebase, I just know what we had happen. It may have changed in the time since I sent this PR and now, but I'm not certain.

@hakusaro
Copy link
Author

It might be easier to find the cause by debugging and specifically inducing a lock contention problem (lock the table up, try to migrate, hit the cutover timeout, and see which code path is taken) and seeing how gh-ost behaves recovering. It works perfectly in our experience, it's just twice as long as we expected to get the table back into a non-outage state.

@timvaillancourt
Copy link
Collaborator

@hakusaro ok thanks!

Will investigate to see if a bug fix could make this behave as expected 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants