Skip to content

Conversation

@jiajunwang
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Resolve #1488

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This PR enhances the message event processing logic to prevent silent failure and unnecessary retry.

  1. If creating a message handler fails unexpectedly (meaning there is an Exception), then the message will be marked as UNPROCESSABLE unless the message is sent with a retry count larger than 0. When the retry count is configured, then before the message runs out of the retry count, the participant will keep retrying on any message callbacks.
  2. The UNPROCESSABLE message, which is generated due to the previous point, will be left in the participant message folder and not automatically removed. This is to prevent unnecessary retry.
  3. If the message handler fails due to the participant cannot schedule the task, then the message will be discarded. If the message is a state transition message, then the corresponding state model and the partition current state will be set to ERROR. This is also to prevent unnecessary retry.

Tests

  • The following tests are written for this issue:

TestHelixTaskExecutor

  • The following is the result of the "mvn test" command on the appropriate module:

[WARNING] Tests run: 1237, Failures: 0, Errors: 0, Skipped: 1
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:00 h
[INFO] Finished at: 2020-10-29T22:59:19-07:00
[INFO] ------------------------------------------------------------------------

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@kaisun2000
Copy link
Contributor

If creating a message handler fails unexpectedly (meaning there is an Exception), then the message will be marked as UNPROCESSABLE unless the message is sent with a retry count larger than 0. When the retry count is configured, then before the message runs out of the retry count, the participant will keep retrying on any message callbacks

if the retry counts is large (say people put -1 instead 0), this is effectively retry forever. How can we make sure this is not going to happen?

@kaisun2000
Copy link
Contributor

kaisun2000 commented Nov 4, 2020

The UNPROCESSABLE message, which is generated due to the previous point, will be left in the participant message folder and not automatically removed. This is to prevent unnecessary retry.

Let say later the participant resolve issue and can create the messageHandler for this partition replica. Then does it mean when they recover, they need to manually (or via rest) to remove this UNPROCESSED message? If so, do we have this REST API?

@jiajunwang
Copy link
Contributor Author

if the retry counts is large (say people put -1 instead 0), this is effectively retry forever. How can we make sure this is not going to happen?

@kaisun2000 What do you mean? If it is -1, then the current logic will try once and stop retrying. And according to what Junkai mentioned, I would like to check for the count even before the first try. So that should not be the case, right?

Let say later the participant resolve issue and can create the messageHandler for this partition replica. Then does it mean when they recover, they need to manually (or via rest) to remove this UNPROCESSED message? If so, do we have this REST API?

This is the plan. We have discussed this in the previous standup meeting. That task will be addressed separately. Before it is ready, we will require the application to reset the participant.

@kaisun2000
Copy link
Contributor

if the retry counts is large (say people put -1 instead 0), this is effectively retry forever. How can we make sure this is not going to happen?

@kaisun2000 What do you mean? If it is -1, then the current logic will try once and stop retrying. And according to what Junkai mentioned, I would like to check for the count even before the first try. So that should not be the case, right?

Let us say if the message have a retry count of INT_MAX, this is not going to work. I think we can be a little bit more conservative that if the message have a retry count larger than a threshold, we just retry the threshold value.

Let say later the participant resolve issue and can create the messageHandler for this partition replica. Then does it mean when they recover, they need to manually (or via rest) to remove this UNPROCESSED message? If so, do we have this REST API?

This is the plan. We have discussed this in the previous standup meeting. That task will be addressed separately. Before it is ready, we will require the application to reset the participant.

@jiajunwang
Copy link
Contributor Author

Let us say if the message have a retry count of INT_MAX, this is not going to work. I think we can be a little bit more conservative that if the message have a retry count larger than a threshold, we just retry the threshold value.

@kaisun2000 , if the user explicitly sets the retry to be infinite, we shall not block it, right? Note that the retry won't cause any Helix controller issue, since it is only retried on the participant side. For ZK servers, yeah, it will lead to many ZK writes potentially. But I think that is a problem at a different level, ZK writes throttling for example.
In addition, compared with the current behavior, even a retry count == INT_MAX works better. Since every retry in the new logic only generates one write. The older logic will remove the message and creating a new one. So at least 2 write IOs for each retry.

@kaisun2000
Copy link
Contributor

kaisun2000 commented Nov 4, 2020

Let us say if the message have a retry count of INT_MAX, this is not going to work. I think we can be a little bit more conservative that if the message have a retry count larger than a threshold, we just retry the threshold value.

@kaisun2000 , if the user explicitly sets the retry to be infinite, we shall not block it, right? Note that the retry won't cause any Helix controller issue, since it is only retried on the participant side. For ZK servers, yeah, it will lead to many ZK writes potentially. But I think that is a problem at a different level, ZK writes throttling for example.
In addition, compared with the current behavior, even a retry count == INT_MAX works better. Since every retry in the new logic only generates one write. The older logic will remove the message and creating a new one. So at least 2 write IOs for each retry.

#1487 only makes it by default not logging to ZK. It can still log to ZK with some configuration. I think if you cap the retry count to some fixed value say 100 would be safer. Just my 2 cents. Up to you.

Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. Good PR for cleaning up.

@jiajunwang
Copy link
Contributor Author

Approved by @dasahcc , I have added the comment as suggested. I will merge the PR shortly.

@jiajunwang jiajunwang merged commit f11396e into apache:master Nov 5, 2020
@jiajunwang jiajunwang deleted the instance branch November 5, 2020 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Participant side message handling logic may fail silently

6 participants