Refine the message handler error handling logic to avoid unnecessary retry. #1500

jiajunwang · 2020-10-30T06:17:20Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Resolve #1488

Description

Here are some details about my PR, including screenshots of any UI changes:

This PR enhances the message event processing logic to prevent silent failure and unnecessary retry.

If creating a message handler fails unexpectedly (meaning there is an Exception), then the message will be marked as UNPROCESSABLE unless the message is sent with a retry count larger than 0. When the retry count is configured, then before the message runs out of the retry count, the participant will keep retrying on any message callbacks.
The UNPROCESSABLE message, which is generated due to the previous point, will be left in the participant message folder and not automatically removed. This is to prevent unnecessary retry.
If the message handler fails due to the participant cannot schedule the task, then the message will be discarded. If the message is a state transition message, then the corresponding state model and the partition current state will be set to ERROR. This is also to prevent unnecessary retry.

Tests

The following tests are written for this issue:

TestHelixTaskExecutor

The following is the result of the "mvn test" command on the appropriate module:

[WARNING] Tests run: 1237, Failures: 0, Errors: 0, Skipped: 1
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:00 h
[INFO] Finished at: 2020-10-29T22:59:19-07:00
[INFO] ------------------------------------------------------------------------

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java

…retry.

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java

kaisun2000 · 2020-11-04T19:15:39Z

If creating a message handler fails unexpectedly (meaning there is an Exception), then the message will be marked as UNPROCESSABLE unless the message is sent with a retry count larger than 0. When the retry count is configured, then before the message runs out of the retry count, the participant will keep retrying on any message callbacks

if the retry counts is large (say people put -1 instead 0), this is effectively retry forever. How can we make sure this is not going to happen?

kaisun2000 · 2020-11-04T19:18:11Z

The UNPROCESSABLE message, which is generated due to the previous point, will be left in the participant message folder and not automatically removed. This is to prevent unnecessary retry.

Let say later the participant resolve issue and can create the messageHandler for this partition replica. Then does it mean when they recover, they need to manually (or via rest) to remove this UNPROCESSED message? If so, do we have this REST API?

jiajunwang · 2020-11-04T19:36:45Z

if the retry counts is large (say people put -1 instead 0), this is effectively retry forever. How can we make sure this is not going to happen?

@kaisun2000 What do you mean? If it is -1, then the current logic will try once and stop retrying. And according to what Junkai mentioned, I would like to check for the count even before the first try. So that should not be the case, right?

Let say later the participant resolve issue and can create the messageHandler for this partition replica. Then does it mean when they recover, they need to manually (or via rest) to remove this UNPROCESSED message? If so, do we have this REST API?

This is the plan. We have discussed this in the previous standup meeting. That task will be addressed separately. Before it is ready, we will require the application to reset the participant.

kaisun2000 · 2020-11-04T20:10:52Z

if the retry counts is large (say people put -1 instead 0), this is effectively retry forever. How can we make sure this is not going to happen?

@kaisun2000 What do you mean? If it is -1, then the current logic will try once and stop retrying. And according to what Junkai mentioned, I would like to check for the count even before the first try. So that should not be the case, right?

Let us say if the message have a retry count of INT_MAX, this is not going to work. I think we can be a little bit more conservative that if the message have a retry count larger than a threshold, we just retry the threshold value.

Let say later the participant resolve issue and can create the messageHandler for this partition replica. Then does it mean when they recover, they need to manually (or via rest) to remove this UNPROCESSED message? If so, do we have this REST API?

This is the plan. We have discussed this in the previous standup meeting. That task will be addressed separately. Before it is ready, we will require the application to reset the participant.

jiajunwang · 2020-11-04T20:31:18Z

Let us say if the message have a retry count of INT_MAX, this is not going to work. I think we can be a little bit more conservative that if the message have a retry count larger than a threshold, we just retry the threshold value.

@kaisun2000 , if the user explicitly sets the retry to be infinite, we shall not block it, right? Note that the retry won't cause any Helix controller issue, since it is only retried on the participant side. For ZK servers, yeah, it will lead to many ZK writes potentially. But I think that is a problem at a different level, ZK writes throttling for example.
In addition, compared with the current behavior, even a retry count == INT_MAX works better. Since every retry in the new logic only generates one write. The older logic will remove the message and creating a new one. So at least 2 write IOs for each retry.

kaisun2000 · 2020-11-04T23:29:31Z

Let us say if the message have a retry count of INT_MAX, this is not going to work. I think we can be a little bit more conservative that if the message have a retry count larger than a threshold, we just retry the threshold value.

@kaisun2000 , if the user explicitly sets the retry to be infinite, we shall not block it, right? Note that the retry won't cause any Helix controller issue, since it is only retried on the participant side. For ZK servers, yeah, it will lead to many ZK writes potentially. But I think that is a problem at a different level, ZK writes throttling for example.
In addition, compared with the current behavior, even a retry count == INT_MAX works better. Since every retry in the new logic only generates one write. The older logic will remove the message and creating a new one. So at least 2 write IOs for each retry.

#1487 only makes it by default not logging to ZK. It can still log to ZK with some configuration. I think if you cap the retry count to some fixed value say 100 would be safer. Just my 2 cents. Up to you.

junkaixue

Overall, LGTM. Good PR for cleaning up.

jiajunwang · 2020-11-05T08:41:06Z

Approved by @dasahcc , I have added the comment as suggested. I will merge the PR shortly.

narendly reviewed Oct 30, 2020

View reviewed changes

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java Outdated Show resolved Hide resolved

narendly reviewed Oct 30, 2020

View reviewed changes

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java Show resolved Hide resolved

narendly reviewed Oct 30, 2020

View reviewed changes

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java Outdated Show resolved Hide resolved

narendly reviewed Oct 30, 2020

View reviewed changes

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java Show resolved Hide resolved

zhangmeng916 reviewed Nov 3, 2020

View reviewed changes

Jiajun Wang added 3 commits November 3, 2020 21:53

Refine the message handler error handling logic to avoid unnecessary …

56d2e22

…retry.

Address comments.

bec10b0

Address more comments.

352d440

jiajunwang force-pushed the instance branch from 927507d to 352d440 Compare November 4, 2020 05:53

junkaixue reviewed Nov 4, 2020

View reviewed changes

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java Show resolved Hide resolved

helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java Show resolved Hide resolved

junkaixue approved these changes Nov 5, 2020

View reviewed changes

Address final comment.

6304edd

jiajunwang merged commit f11396e into apache:master Nov 5, 2020

jiajunwang deleted the instance branch November 5, 2020 17:19

Refine the message handler error handling logic to avoid unnecessary retry. #1500

Refine the message handler error handling logic to avoid unnecessary retry. #1500

Uh oh!

Conversation

jiajunwang commented Oct 30, 2020

Issues

Description

Tests

Documentation (Optional)

Commits

Code Quality

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaisun2000 commented Nov 4, 2020

Uh oh!

kaisun2000 commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiajunwang commented Nov 4, 2020

Uh oh!

kaisun2000 commented Nov 4, 2020

Uh oh!

jiajunwang commented Nov 4, 2020

Uh oh!

kaisun2000 commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junkaixue left a comment

Choose a reason for hiding this comment

Uh oh!

jiajunwang commented Nov 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kaisun2000 commented Nov 4, 2020 •

edited

Loading

kaisun2000 commented Nov 4, 2020 •

edited

Loading