Skip to content

HBASE-29386: SnapshotProcedure and EnableTableProcedure can cause a deadlock #7084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hgromer
Copy link
Contributor

@hgromer hgromer commented Jun 10, 2025

No description provided.

@hgromer
Copy link
Contributor Author

hgromer commented Jun 10, 2025

@@ -136,4 +138,48 @@ public void run() {
SnapshotTestingUtils.confirmSnapshotValid(TEST_UTIL, snapshotProto, TABLE_NAME, CF);
SnapshotTestingUtils.confirmSnapshotValid(TEST_UTIL, snapshotOnSameTableProto, TABLE_NAME, CF);
}

@Test
public void testItCanEnableTableWhileSnapshotProcedureIsRunning() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails without my changes, both procedures will run forever. This also functions as the reproduction of the phenomenon.

For reproducibility, I manually set the table state to ENABLING, and then kick off the EnableTableProcedure at the stage it gets stuck on (after subprocesses finish)

this.shouldForceLockRelease = shouldForceLockRelease;
}

public boolean shouldForceLockRelease() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Procedure class has a holdLock method which returns true for SnapshotProcedure. This prevents us from releasing the lock even on suspension. I believe that at the stage where the deadlock could happen, it's safe to release the lock. We haven't done any snapshotting yet, so I believe there shouldn't be any weird data inconsistencies from region splits/merges.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just modify the SnapshotProcedure.holdLock() behaviour? Afterall, this is contrary to what we say in that method comment:

protected boolean holdLock(MasterProcedureEnv env) { // In order to avoid enabling/disabling/modifying/deleting table during snapshot, // we don't release lock during suspend return true; }
And you are already modifying SnapshotProcedure execution logic to reflect the special condition when the lock should be released, so better keep this there and avoid changing ProcedureSuspendedException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should. If we did, we'd have potential inconsistencies. As I mention in my comment, this bypass allows us to release the lock prior to us actually snapshotting any data. Once we start snapshotting (anything in at or past SNAPSHOT_SNAPSHOT_ONLINE_REGIONS) it is unsafe to release the lock.

I think the comment still holds. We want to avoid enabling/disabling/modifying/deleting a table during a snapshot. The procedure suspension occurs prior to us starting to snapshot any data, at SNAPSHOT_WRITE_SNAPSHOT_INFO.

Copy link
Contributor Author

@hgromer hgromer Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, maybe a less confusing option would be to return a non-static value for holdLock in SnapshotProcedure

This forces you to manage additional state at each stage though, which I why I opted for this implementation which I thought was safer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment still holds. We want to avoid enabling/disabling/modifying/deleting a table during a snapshot.

We may want to avoid, but now we are allowing a case where we suspend and release the lock.

Alternatively, maybe a less confusing option would be to return a non-static value for holdLock in SnapshotProcedure

That's what I meant by "modify the SnapshotProcedure.holdLock() behaviour". You are already modifying SnapshotProcedure behaviour to cause a release lock suspension via a flag in the ProcedureSuspendedException. We could keep this "flag" internally in SnapshotProcedure and use it to define what SnapshotProcedure.holdLock() returns.

This forces you to manage additional state at each stage though, which I why I opted for this implementation which I thought was safer

True. Yet, it seems more intuitive to me. But that's a personal opinion, just wanted to make sure it was also considered. Up to you to decide on the "final" solution. If going with the original, just please amend the comment in holdLock to explain the scenario where we release the lock.

TableState tableState =
env.getMasterServices().getTableStateManager().getTableState(snapshotTable);
if (tableState.isEnabled()) {
SnapshotDescriptionUtils.writeSnapshotInfo(snapshot, workingDir, workingDirFS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move inside the if statements to avoid unnecessary I/O in case we need to suspend

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@charlesconnell
Copy link
Contributor

Another option, which is simpler to reason about, is to simply fail the SnapshotProcedure if the table isn't in a state it can handle. This was my approach in HBASE-29315, for the exact same reasons you're encountering here. I couldn't usefully "sleep" a SplitTableRegionProcedure because it used holdLock=true. I think it would be good to agree on a standard of what to do in these situations.

Failing a procedure is simpler and doesn't introduce more edge cases in the procedure executor state machine. However, obviously, it's a better user experience if your procedures get executed eventually.

@hgromer
Copy link
Contributor Author

hgromer commented Jun 10, 2025

Another option, which is simpler to reason about, is to simply fail the SnapshotProcedure if the table isn't in a state it can handle. This was my approach in HBASE-29315, for the exact same reasons you're encountering here. I couldn't usefully "sleep" a SplitTableRegionProcedure because it used holdLock=true. I think it would be good to agree on a standard of what to do in these situations.

Failing a procedure is simpler and doesn't introduce more edge cases in the procedure executor state machine. However, obviously, it's a better user experience if your procedures get executed eventually.

Agreed. The reason I opted for supsending the procedure is because this is already something we do if we are splitting or merging regions. So this procedure will already suspend and resume in the case the table state isn't optimal.

I don't think I have a very strong opinion one way or another, though. Enabling/disabling tables happens often enough that it might be fine to simply fail and force the user to manually handle the failure.

cc @Apache9 Don't know if you have any strong opinions here either way.

@rmdmattingly
Copy link
Contributor

Enabling/disabling tables happens often enough that it might be fine to simply fail and force the user to manually handle the failure.

Do you mean "happens often infrequently enough"? If so, I agree that snapshotting a disabled table should be a pretty exceptional request, and so it is fine to err on the side of simplicity and just fail the procedure

@hgromer
Copy link
Contributor Author

hgromer commented Jun 10, 2025

Enabling/disabling tables happens often enough that it might be fine to simply fail and force the user to manually handle the failure.

Do you mean "happens often infrequently enough"? If so, I agree that snapshotting a disabled table should be a pretty exceptional request, and so it is fine to err on the side of simplicity and just fail the procedure

Yes I did; I am happy to fail the procedure if that's the general consensus. That being said, I do think modern backups start to stress this system out. Any call to the BackupAdmin triggers an enable table procedure, so we're more likely to get into this state.

If we are okay with failing, then the user will have to manually kick off another snapshot on failure.

Another possible solution would be to reset the state of the snapshot procedure, so that it needs to run from the beginning

if (tableState.isEnabled()) {
  // action 
} else if (tableState.isDisabled) {
  // action
} else {
  // set up suspension timeout/persistence
  setNextState(SnapshotState.SNAPSHOT_PREPARE);
}

@hgromer
Copy link
Contributor Author

hgromer commented Jun 10, 2025

going to proceed with the strategy of simply failing

@Apache-HBase

This comment has been minimized.

Copy link
Contributor

@charlesconnell charlesconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

Copy link
Contributor

@Apache9 Apache9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing this SnapshotProcedure, we decided to use shared lock and holdLock = true to prevent other table procedure jumps in in the middle of our execution while not hurting the availability because exclusive lock will also prevent region assigning.

In HBASE-28683, we introduced a new way to only allow one procedure to run at the same time for the same table, so maybe a possible way is to make SnapshotProcedure also acquire exclusive lock and set holdLock = false, so it will not be executed at the same time with Enable and Disable procedure.

Thanks.

setNextState(SnapshotState.SNAPSHOT_SNAPSHOT_CLOSED_REGIONS);
} else {
setState(ProcedureState.FAILED);
throw new ProcedureAbortedException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use setFailure and then return Flow.NO_MORE_STATE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thank you for the suggestion

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer
Copy link
Contributor Author

hgromer commented Jun 11, 2025

When implementing this SnapshotProcedure, we decided to use shared lock and holdLock = true to prevent other table procedure jumps in in the middle of our execution while not hurting the availability because exclusive lock will also prevent region assigning.

In HBASE-28683, we introduced a new way to only allow one procedure to run at the same time for the same table, so maybe a possible way is to make SnapshotProcedure also acquire exclusive lock and set holdLock = false, so it will not be executed at the same time with Enable and Disable procedure.

Thanks.

That's good to know, thank you for the context. The issue here is that the EnableTableProcedure will release the lock after it's subprocedures finish, which allows the SnapshotProcedure to execute on a table that is enabling. The SnapshotProcedure then gets stuck continuously re-running the same state over again, and will refuse to release the lock, creating a deadlock

With all this said, I think it makes the most sense, for simplicity's sake, to fail the procedure if the table is in an invalid state.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9
Copy link
Contributor

Apache9 commented Jun 13, 2025

When implementing this SnapshotProcedure, we decided to use shared lock and holdLock = true to prevent other table procedure jumps in in the middle of our execution while not hurting the availability because exclusive lock will also prevent region assigning.
In HBASE-28683, we introduced a new way to only allow one procedure to run at the same time for the same table, so maybe a possible way is to make SnapshotProcedure also acquire exclusive lock and set holdLock = false, so it will not be executed at the same time with Enable and Disable procedure.
Thanks.

That's good to know, thank you for the context. The issue here is that the EnableTableProcedure will release the lock after it's subprocedures finish, which allows the SnapshotProcedure to execute on a table that is enabling. The SnapshotProcedure then gets stuck continuously re-running the same state over again, and will refuse to release the lock, creating a deadlock

With all this said, I think it makes the most sense, for simplicity's sake, to fail the procedure if the table is in an invalid state.

Using the mechanism in HBASE-28683, SnapshotProcedure can not be executed together with EnableTableProcedure/DisabledTableProcedure together, which could also solve the problem, and I think it is more stable. I'm not sure whether ModifyTableProcedure could also affect SnapshotProcedure if it jumps in just in the middle of the execution...

@hgromer
Copy link
Contributor Author

hgromer commented Jun 13, 2025

When implementing this SnapshotProcedure, we decided to use shared lock and holdLock = true to prevent other table procedure jumps in in the middle of our execution while not hurting the availability because exclusive lock will also prevent region assigning.
In HBASE-28683, we introduced a new way to only allow one procedure to run at the same time for the same table, so maybe a possible way is to make SnapshotProcedure also acquire exclusive lock and set holdLock = false, so it will not be executed at the same time with Enable and Disable procedure.
Thanks.

That's good to know, thank you for the context. The issue here is that the EnableTableProcedure will release the lock after it's subprocedures finish, which allows the SnapshotProcedure to execute on a table that is enabling. The SnapshotProcedure then gets stuck continuously re-running the same state over again, and will refuse to release the lock, creating a deadlock
With all this said, I think it makes the most sense, for simplicity's sake, to fail the procedure if the table is in an invalid state.

Using the mechanism in HBASE-28683, SnapshotProcedure can not be executed together with EnableTableProcedure/DisabledTableProcedure together, which could also solve the problem, and I think it is more stable. I'm not sure whether ModifyTableProcedure could also affect SnapshotProcedure if it jumps in just in the middle of the execution...

Yes, I'm able to avoid the deadlock if the SnapshotProcedure both sets holdLock = true and also yields after procedure execution. Essentially allowing both SnapshotProcedure and EnableTableProcedure to execute without the need to throw any sort of yield / suspended exception (which was my initial implementation).

This is certainly a cleaner implementation, though it means we can interleave other table procedures after cycles of the SnapshotProcedure.

You hint at this here

I'm not sure whether ModifyTableProcedure could also affect SnapshotProcedure if it jumps in just in the middle of the execution...

and I believe that it would be problematic because you may snapshot regions multiple times, regions may move or go offline, so you may miss out on snapshotting regions.

My initial implementation prevented this by only allowing SNAPSHOT_WRITE_SNAPSHOT_INFO to release the lock (when it's still safe to do so). However, I don't think it's safe to yield the lock after any cycle.

After some thought, I think we still have two solutions here:

  1. My initial implementation, which throws an exception that indicates we should release the lock from within SNAPSHOT_WRITE_SNAPSHOT_INFO
  2. Fail the procedure

It seems the consensus is to avoid complicating the procedure and fail if we encounter the table in an invalid state, so I'm leaning towards 2 thought admittedly I'm a bit torn

@Apache9
Copy link
Contributor

Apache9 commented Jun 14, 2025

This is certainly a cleaner implementation, though it means we can interleave other table procedures after cycles of the SnapshotProcedure.

FWIW, if a SnapshotProcedure starts, no other table procedures which need a exclusive lock can be executed...

And what I mean is that, we can reuse the mechanism introduced in HBASE-28683 to simply fix the problem.

Just change the code here

Make Snapshot also return true, and change SnapshotProcedure's acquireLock method to also require exclusive lock, change holdLock to return false, then we are safe.

Thanks.

@hgromer
Copy link
Contributor Author

hgromer commented Jun 14, 2025

This is certainly a cleaner implementation, though it means we can interleave other table procedures after cycles of the SnapshotProcedure.

FWIW, if a SnapshotProcedure starts, no other table procedures which need a exclusive lock can be executed...

And what I mean is that, we can reuse the mechanism introduced in HBASE-28683 to simply fix the problem.

Just change the code here

Make Snapshot also return true, and change SnapshotProcedure's acquireLock method to also require exclusive lock, change holdLock to return false, then we are safe.

Thanks.

Ah okay, I understand. This does prevent us from being able to take multiple snapshots of the table, as mentioned in this comment but that seems ok

@hgromer hgromer force-pushed the HBASE-29386 branch 2 times, most recently from 570efea to 8adc8bf Compare June 16, 2025 21:18
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9
Copy link
Contributor

Apache9 commented Jun 18, 2025

TestSnapshotProcedureRIT.testTableInMergeWhileTakingSnapshot failed, I guess this is related to our changes here?

@hgromer
Copy link
Contributor Author

hgromer commented Jun 18, 2025

TestSnapshotProcedureRIT.testTableInMergeWhileTakingSnapshot failed, I guess this is related to our changes here?

Yes must be related I am taking a look

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 32s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+1 💚 mvninstall 3m 54s master passed
+1 💚 compile 3m 24s master passed
+1 💚 checkstyle 0m 38s master passed
+1 💚 spotbugs 1m 40s master passed
+1 💚 spotless 0m 49s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 7s the patch passed
+1 💚 compile 3m 17s the patch passed
+1 💚 javac 3m 17s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 37s the patch passed
+1 💚 spotbugs 1m 42s the patch passed
+1 💚 hadoopcheck 12m 16s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 45s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 10s The patch does not generate ASF License warnings.
40m 38s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7084/9/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7084
JIRA Issue HBASE-29386
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 4e14747c481e 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f14a7cf
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 84 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7084/9/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 31s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 53s master passed
+1 💚 compile 0m 58s master passed
+1 💚 javadoc 0m 28s master passed
+1 💚 shadedjars 6m 5s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 11s the patch passed
+1 💚 compile 0m 57s the patch passed
+1 💚 javac 0m 57s the patch passed
+1 💚 javadoc 0m 27s the patch passed
+1 💚 shadedjars 6m 2s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
-1 ❌ unit 214m 55s /patch-unit-hbase-server.txt hbase-server in the patch failed.
242m 39s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7084/9/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7084
JIRA Issue HBASE-29386
Optional Tests javac javadoc unit compile shadedjars
uname Linux 9a40d0fa1370 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f14a7cf
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7084/9/testReport/
Max. process+thread count 5312 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7084/9/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants