-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-24877 Add option to avoid aborting RS process upon uncaught exc… #2255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…eptions happen on replication source
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments
...server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
Outdated
Show resolved
Hide resolved
...server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
Outdated
Show resolved
Hide resolved
Thread.currentThread().getName() + ".replicationSource," + this.queueId, | ||
(t,e) -> { | ||
uncaughtException(t, e); | ||
retryStartup.set(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope if we encounter uncaughtException here, we want to retry the loop again.
If so, shall we also add startupOngoing.set(true);
here explicitely? Just in case if it is set to false in initialize()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope if we encounter uncaughtException here, we want to retry the loop again.
Yes, we want to keep trying until we succeed.
If so, shall we also add startupOngoing.set(true); here explicitely? Just in case if it is set to false in initialize()?
The only cases startupOngoing
would had been set to false
in initialize
is if it completes fine without any uncaught exception, so adding it here would be redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah right, it is not set to false
if it completes successfully, anyone touching the same code in future should realize this.
Although not a strong point but If you don't mind, maybe we can comment here indicating startupOngoing
is expected to be true when we are here handling Exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the nits and added some comments to the two flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's next if we ignore the exception? We will retry later? Or we will just go on without this replication source? Users will then find out that the cluster is fine but data has not been replicated out? I'm not sure if this is correct way, we fix an issue but introduce another hard to find issue?
Adding a flag can keep the old behavior but we give users an impression that the exception can be ignored? Still not sure if this is the correct way to fix this...
Mind explaining more on your real usage?
Thanks.
As you can see on
It's common practice to verify replication status after a maintenance.
It does not fail silently, errors will get logged, and it gives operators the chance to look after what's going wrong without a complete downtime of their source clusters.
We do use some custom replication endpoints that under certain unavailability of some target peer hosts ended up throwing uncaught exception and aborting the source RSes. Sure, there could be improvements on the custom code, and it was an internal infra issue, but with a flag like this, we wouldn't need to face a period of outage at the source. |
OK, thank you for your reply. Let me take a look at the code more carefully. |
🎊 +1 overall
This message was automatically generated. |
@@ -373,7 +389,21 @@ private void tryStartNewShipper(String walGroupId, PriorityBlockingQueue<Path> q | |||
Threads.setDaemonThreadRunning( | |||
walReader, Thread.currentThread().getName() | |||
+ ".replicationSource.wal-reader." + walGroupId + "," + queueId, | |||
this::uncaughtException); | |||
(t,e) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here it is for wal reader. I think refreshSources and retry is an acceptable way. Then let's just test the abortOnError flag here? If it is true, we will call uncaughtException, otherwise we will try to refresh the replication source.
@@ -290,7 +290,22 @@ private boolean updateLogPosition(WALEntryBatch batch) { | |||
public void startup(UncaughtExceptionHandler handler) { | |||
String name = Thread.currentThread().getName(); | |||
Threads.setDaemonThreadRunning(this, | |||
name + ".replicationSource.shipper" + walGroupId + "," + source.getQueueId(), handler); | |||
name + ".replicationSource.shipper" + walGroupId + "," + source.getQueueId(), | |||
(t,e) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the code is almost the same... Then I think we could move the logic into uncaughtException method? If abortOnError is true, we about, otherwise we will try to refresh the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to uncaughtException. This is true when handling potential errors on both the reader and shipper threads, but uncaughtException is also used when handling issues from ReplicationSource.initialize method, which might blow before reader/shipper is even started. So needed to add extra check in uncaughtException to decide when to invoke refreshSources.
Threads.setDaemonThreadRunning(initThread, | ||
Thread.currentThread().getName() + ".replicationSource," + this.queueId, | ||
this::uncaughtException); | ||
this.retryStartup.set(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag is only used in this method? Let's use a local var instead of a class member field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, originally I was also referring it on uncaughtException, but it was not actually needed.
...server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
Show resolved
Hide resolved
Thanks for the suggestions, @Apache9, had a pushed a new commit addressing those, let me know on your thoughts. |
🎊 +1 overall
This message was automatically generated. |
// an initialization happening (but not finished yet), | ||
// so that it doesn't try submit another initialize thread. | ||
// NOTE: this should only be set to false at the end of initialize method, prior to return. | ||
// private final AtomicBoolean startupOngoing = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent?
if(abortOnError){ | ||
server.abort("Unexpected exception in " + t.getName(), e); | ||
} | ||
if(manager!=null){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update the formatter config? Usually it should be 'if (manager != null) {'.
this::uncaughtException); | ||
//Flag that signalizes uncaught error happening while starting up the source | ||
// and a retry should be attempted | ||
AtomicBoolean retryStartup = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need a AtomicBoolean here? It is only used locally, so a simple boolean is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been modified by the lambda expression on line #635, since lambdas expressions require variables to be final, I can't use a simple, local primitive boolean or wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then use MutableBoolean? We do not need to use atomic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for suggesting MutableBoolean, am not too familiar with apache commons lib.
retryStartup.set(true); | ||
}); | ||
} | ||
} while (!this.sourceRunning); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a second thought on this here, we can't simply re-use this boolean, because in case of failure, we risk reach this point before the exception handler has updated it to false. I'm bringing back the original startupOngoing in the next commit,
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Pushed a new commit addressing latest suggestion and checkstyle issues. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
retest build |
1 similar comment
retest build |
This does not work, you need to go to the jenkins page to manually start a build, or just do a rebase and force push here, if you want to trigger a new build. |
🎊 +1 overall
This message was automatically generated. |
It looks like this is causing some of the UTs to timeout. Let me dig into it further. |
…mplete initialization but don't care about it
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Latest UT failure seems unrelated, have it passing locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me. Seems like Duo has given this a thorough investigation :)
…eptions happen on replication source