-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29386: SnapshotProcedure and EnableTableProcedure can cause a deadlock #7084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -136,4 +138,48 @@ public void run() { | |||
SnapshotTestingUtils.confirmSnapshotValid(TEST_UTIL, snapshotProto, TABLE_NAME, CF); | |||
SnapshotTestingUtils.confirmSnapshotValid(TEST_UTIL, snapshotOnSameTableProto, TABLE_NAME, CF); | |||
} | |||
|
|||
@Test | |||
public void testItCanEnableTableWhileSnapshotProcedureIsRunning() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails without my changes, both procedures will run forever. This also functions as the reproduction of the phenomenon.
For reproducibility, I manually set the table state to ENABLING, and then kick off the EnableTableProcedure at the stage it gets stuck on (after subprocesses finish)
this.shouldForceLockRelease = shouldForceLockRelease; | ||
} | ||
|
||
public boolean shouldForceLockRelease() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Procedure class has a holdLock
method which returns true for SnapshotProcedure. This prevents us from releasing the lock even on suspension. I believe that at the stage where the deadlock could happen, it's safe to release the lock. We haven't done any snapshotting yet, so I believe there shouldn't be any weird data inconsistencies from region splits/merges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just modify the SnapshotProcedure.holdLock() behaviour? Afterall, this is contrary to what we say in that method comment:
protected boolean holdLock(MasterProcedureEnv env) { // In order to avoid enabling/disabling/modifying/deleting table during snapshot, // we don't release lock during suspend return true; }
And you are already modifying SnapshotProcedure execution logic to reflect the special condition when the lock should be released, so better keep this there and avoid changing ProcedureSuspendedException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should. If we did, we'd have potential inconsistencies. As I mention in my comment, this bypass allows us to release the lock prior to us actually snapshotting any data. Once we start snapshotting (anything in at or past SNAPSHOT_SNAPSHOT_ONLINE_REGIONS) it is unsafe to release the lock.
I think the comment still holds. We want to avoid enabling/disabling/modifying/deleting a table during a snapshot. The procedure suspension occurs prior to us starting to snapshot any data, at SNAPSHOT_WRITE_SNAPSHOT_INFO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, maybe a less confusing option would be to return a non-static value for holdLock
in SnapshotProcedure
This forces you to manage additional state at each stage though, which I why I opted for this implementation which I thought was safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment still holds. We want to avoid enabling/disabling/modifying/deleting a table during a snapshot.
We may want to avoid, but now we are allowing a case where we suspend and release the lock.
Alternatively, maybe a less confusing option would be to return a non-static value for
holdLock
inSnapshotProcedure
That's what I meant by "modify the SnapshotProcedure.holdLock() behaviour". You are already modifying SnapshotProcedure behaviour to cause a release lock suspension via a flag in the ProcedureSuspendedException. We could keep this "flag" internally in SnapshotProcedure and use it to define what SnapshotProcedure.holdLock() returns.
This forces you to manage additional state at each stage though, which I why I opted for this implementation which I thought was safer
True. Yet, it seems more intuitive to me. But that's a personal opinion, just wanted to make sure it was also considered. Up to you to decide on the "final" solution. If going with the original, just please amend the comment in holdLock to explain the scenario where we release the lock.
TableState tableState = | ||
env.getMasterServices().getTableStateManager().getTableState(snapshotTable); | ||
if (tableState.isEnabled()) { | ||
SnapshotDescriptionUtils.writeSnapshotInfo(snapshot, workingDir, workingDirFS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move inside the if statements to avoid unnecessary I/O in case we need to suspend
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Another option, which is simpler to reason about, is to simply fail the Failing a procedure is simpler and doesn't introduce more edge cases in the procedure executor state machine. However, obviously, it's a better user experience if your procedures get executed eventually. |
Agreed. The reason I opted for supsending the procedure is because this is already something we do if we are splitting or merging regions. So this procedure will already suspend and resume in the case the table state isn't optimal. I don't think I have a very strong opinion one way or another, though. Enabling/disabling tables happens often enough that it might be fine to simply fail and force the user to manually handle the failure. cc @Apache9 Don't know if you have any strong opinions here either way. |
Do you mean "happens |
Yes I did; I am happy to fail the procedure if that's the general consensus. That being said, I do think modern backups start to stress this system out. Any call to the BackupAdmin triggers an enable table procedure, so we're more likely to get into this state. If we are okay with failing, then the user will have to manually kick off another snapshot on failure. Another possible solution would be to reset the state of the snapshot procedure, so that it needs to run from the beginning if (tableState.isEnabled()) {
// action
} else if (tableState.isDisabled) {
// action
} else {
// set up suspension timeout/persistence
setNextState(SnapshotState.SNAPSHOT_PREPARE);
} |
going to proceed with the strategy of simply failing |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing this SnapshotProcedure, we decided to use shared lock and holdLock = true to prevent other table procedure jumps in in the middle of our execution while not hurting the availability because exclusive lock will also prevent region assigning.
In HBASE-28683, we introduced a new way to only allow one procedure to run at the same time for the same table, so maybe a possible way is to make SnapshotProcedure also acquire exclusive lock and set holdLock = false, so it will not be executed at the same time with Enable and Disable procedure.
Thanks.
setNextState(SnapshotState.SNAPSHOT_SNAPSHOT_CLOSED_REGIONS); | ||
} else { | ||
setState(ProcedureState.FAILED); | ||
throw new ProcedureAbortedException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use setFailure and then return Flow.NO_MORE_STATE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thank you for the suggestion
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
That's good to know, thank you for the context. The issue here is that the EnableTableProcedure will release the lock after it's subprocedures finish, which allows the SnapshotProcedure to execute on a table that is enabling. The SnapshotProcedure then gets stuck continuously re-running the same state over again, and will refuse to release the lock, creating a deadlock With all this said, I think it makes the most sense, for simplicity's sake, to fail the procedure if the table is in an invalid state. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Using the mechanism in HBASE-28683, SnapshotProcedure can not be executed together with EnableTableProcedure/DisabledTableProcedure together, which could also solve the problem, and I think it is more stable. I'm not sure whether ModifyTableProcedure could also affect SnapshotProcedure if it jumps in just in the middle of the execution... |
Yes, I'm able to avoid the deadlock if the SnapshotProcedure both sets This is certainly a cleaner implementation, though it means we can interleave other table procedures after cycles of the SnapshotProcedure. You hint at this here
and I believe that it would be problematic because you may snapshot regions multiple times, regions may move or go offline, so you may miss out on snapshotting regions. My initial implementation prevented this by only allowing SNAPSHOT_WRITE_SNAPSHOT_INFO to release the lock (when it's still safe to do so). However, I don't think it's safe to yield the lock after any cycle. After some thought, I think we still have two solutions here:
It seems the consensus is to avoid complicating the procedure and fail if we encounter the table in an invalid state, so I'm leaning towards 2 thought admittedly I'm a bit torn |
FWIW, if a SnapshotProcedure starts, no other table procedures which need a exclusive lock can be executed... And what I mean is that, we can reuse the mechanism introduced in HBASE-28683 to simply fix the problem. Just change the code here hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/TableQueue.java Line 62 in 64c582f
Make Snapshot also return true, and change SnapshotProcedure's acquireLock method to also require exclusive lock, change holdLock to return false, then we are safe. Thanks. |
Ah okay, I understand. This does prevent us from being able to take multiple snapshots of the table, as mentioned in this comment but that seems ok |
570efea
to
8adc8bf
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
TestSnapshotProcedureRIT.testTableInMergeWhileTakingSnapshot failed, I guess this is related to our changes here? |
Yes must be related I am taking a look |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
No description provided.