Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exit main thread with same exit code as payment_producer #713

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nicolasochem
Copy link
Contributor

#679 introduced support for exit codes, so an alert can be sent in single-shot mode when payouts fail for any reason.

However, it was crude, only supporting exit code 1.

The producer thread supports many exit codes.

In this case, there is a benign issue where tzkt returns "not synced" and therefore payouts fail, but this is likely temporary and will pass at next try, so there is no need to alert. But, currently it's not possible to behave differently based on the exit code because it's always 0 or 1.

An ugly solution is to save the exit code of the child thread in a file, then read it in the main thread. That's what I am doing here. I remain convinced that the entire thread architecture needs to go away, and we need to make TRD single threaded again, but that's for another day.

Also:

  • change the exit code of misconfigured provider to GENERAL_ERROR because it's not really a provider error,
  • change the help to remove old providers that we don't support anymore

name: Pull Request
about: Create a pull request to make a contribution
labels:


IMPORTANT NOTICE:
I read and understood the guidelines for contributions to the TRD. The contribution may qualify for being compensated by the TRD grant if approved by the maintainers.

This PR resolves the issue . The following steps were performed:

  • Analysis: If the described issue is a bug report, analyze the reasons resulting in this bug.

  • Solution: Describe the proposed solution for the bug or feature.

  • Implementation: Rough description/explanation of the implementation choices.

  • Performed tests: Describe the performed tests.

  • Documentation: Make sure to document the added changes in a proper way (Readme, help section, documentation, comments in code if needed)

  • Check list:

  • I extended the Github Actions CI test units with the corresponding tests for this new feature (if needed).
  • I extended the Sphinx documentation (if needed).
  • I extended the help section (if needed).
  • I changed the README file (if needed).
  • I created a new issue if there is further work left to be done (if needed).

Work effort: Give your estimate of the work effort in hours. This might be adjusted or discussed by the other contributors in order to keep a fair rewarding process for the efforts.

#679
introduced support for exit codes, so an alert can be sent in
single-shot mode when payouts fail for any reason.

However, it was crude, only supporting exit code 1.

The producer thread supports many exit codes.

In this case, there is a benign issue where tzkt returns "not synced"
and therefore payouts fail, but this is likely temporary and will pass
at next try, so there is no need to alert. But, currently it's not
possible to behave differently based on the exit code because it's
always 0 or 1.

An ugly solution is to save the exit code of the child thread in a file,
then read it in the main thread. That's what I am doing here. I remain
convinced that the entire thread architecture needs to go away, and we
need to make TRD single threaded again, but that's for another day.

Also:
* change the exit code of misconfigured provider to GENERAL_ERROR
  because it's not really a provider error,
* change the help to remove old providers that we don't support anymore
@TPXP
Copy link
Contributor

TPXP commented Oct 2, 2024

Writing the exit code to a temporary file is a bit odd, can't we use a singleton python module to store this data?

Here's a basic implementation

# exit_code.py
exitCode = -1
exitOrigin = ""

def set_exit_code(origin: str, code: int):
  # Alternatively, use inspect.stack to get the caller details https://stackoverflow.com/questions/1095543/get-name-of-calling-functions-module-in-python
  if exitOrigin == "":
    exitCode = code
    exitOrigin = origin
  else:
    logger.warn(f"{origin} tried to set exit code to {code} while it had already been set by {exitOrigin} to {exitCode}. Ignoring")

def get_exit_code() -> int:
  return exitCode

Usage:

from exit_code import set_exit_code

# ... function code ...
set_exit_code("module.my_function", 3)

nicolasochem added a commit to tacoinfra/tezos-k8s that referenced this pull request Oct 4, 2024
The TRD chart runs tezos reward distribution software for delegations.
We support sending slack alerts when reward distribution fail. Here, we
address a common false positive:

Provider errors are usually because tzkt is rate limited or busy.

example:

│ 2024-09-29 21:02:16,534 - MainThread - INFO - --------------------------------------------                                                                                                                                                 │
│ 2024-09-29 21:02:16,535 - MainThread - INFO - BAKING ADDRESS is
│ 2024-09-29 21:02:16,535 - MainThread - INFO - PAYMENT ADDRESS is
│ 2024-09-29 21:02:16,535 - MainThread - INFO - --------------------------------------------                                                                                                                                                 │
│ 2024-09-29 21:02:16,537 - MainThread - INFO - [Plugins] No plugins enabled                                                                                                                                                                 │
│ 2024-09-29 21:02:16,539 - MainThread - INFO - Initial cycle set to -1                                                                                                                                                                      │
│ 2024-09-29 21:02:16,542 - MainThread - INFO - Application is READY!                                                                                                                                                                        │
│ 2024-09-29 21:02:16,544 - producer  - INFO - No failed payment files found under directory '/trd/reports/xxx/payments/failed' on or after cycle '-1'                                                      │
│ 2024-09-29 21:02:16,545 - MainThread - INFO - --------------------------------------------                                                                                                                                                 │
│ 2024-09-29 21:02:16,624 - producer  - ERROR - Unable to fetch current cycle from provider tzkt, Not synced. Exiting.                                                                                                                       │
│ 2024-09-29 21:02:16,626 - consumer0 - WARNING - Exit signal received. Terminating...                                                                                                                                                       │
│ 2024-09-29 21:02:16,626 - MainThread - INFO - Application stop handler called: 12                                                                                                                                                          │
│ 2024-09-29 21:02:16,628 - producer  - INFO - TRD Exit triggered by producer, exit code: 8                                                                                                                                                  │
│ 2024-09-29 21:02:16,629 - MainThread - INFO - TRD is shutting down...                                                                                                                                                                      │
│ 2024-09-29 21:02:16,630 - MainThread - INFO - --------------------------------------------------------                                                                                                                                     │
│ 2024-09-29 21:02:16,631 - MainThread - INFO - Sensitive operations are in progress!                                                                                                                                                        │
│ 2024-09-29 21:02:16,631 - MainThread - INFO - Please wait while the application is being shut down!                                                                                                                                        │
│ 2024-09-29 21:02:16,632 - MainThread - INFO - --------------------------------------------------------                                                                                                                                     │
│ 2024-09-29 21:02:16,632 - MainThread - INFO - Lock file removed!                                                                                                                                                                           │
│ 2024-09-29 21:02:16,633 - MainThread - INFO - Shutdown due to error!, exit code: 1                                                                                                                                                         │
│ Tezos Reward Distributor (TRD) is Starting

We also modify TRD to add a specific error code for this specific benign
case:
tezos-reward-distributor-organization/tezos-reward-distributor#713
@@ -394,6 +395,27 @@ def producer_exit_handler(self, signum, frame):

def shut_down_on_error(self):
self.fsm.trigger_event(TrdEvent.SHUT_DOWN_ON_ERROR)
exit_code = ExitCode.GENERAL_ERROR
if os.path.exists(EXIT_CODE_FILE):
logger.info("NOCHEM exit code file exists")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove NOCHEM ?

nicolasochem added a commit to tacoinfra/tezos-k8s that referenced this pull request Oct 4, 2024
* TRD ignore provider error - do not send alert

The TRD chart runs tezos reward distribution software for delegations.
We support sending slack alerts when reward distribution fail. Here, we
address a common false positive:

Provider errors are usually because tzkt is rate limited or busy.

example:

│ 2024-09-29 21:02:16,534 - MainThread - INFO - --------------------------------------------                                                                                                                                                 │
│ 2024-09-29 21:02:16,535 - MainThread - INFO - BAKING ADDRESS is
│ 2024-09-29 21:02:16,535 - MainThread - INFO - PAYMENT ADDRESS is
│ 2024-09-29 21:02:16,535 - MainThread - INFO - --------------------------------------------                                                                                                                                                 │
│ 2024-09-29 21:02:16,537 - MainThread - INFO - [Plugins] No plugins enabled                                                                                                                                                                 │
│ 2024-09-29 21:02:16,539 - MainThread - INFO - Initial cycle set to -1                                                                                                                                                                      │
│ 2024-09-29 21:02:16,542 - MainThread - INFO - Application is READY!                                                                                                                                                                        │
│ 2024-09-29 21:02:16,544 - producer  - INFO - No failed payment files found under directory '/trd/reports/xxx/payments/failed' on or after cycle '-1'                                                      │
│ 2024-09-29 21:02:16,545 - MainThread - INFO - --------------------------------------------                                                                                                                                                 │
│ 2024-09-29 21:02:16,624 - producer  - ERROR - Unable to fetch current cycle from provider tzkt, Not synced. Exiting.                                                                                                                       │
│ 2024-09-29 21:02:16,626 - consumer0 - WARNING - Exit signal received. Terminating...                                                                                                                                                       │
│ 2024-09-29 21:02:16,626 - MainThread - INFO - Application stop handler called: 12                                                                                                                                                          │
│ 2024-09-29 21:02:16,628 - producer  - INFO - TRD Exit triggered by producer, exit code: 8                                                                                                                                                  │
│ 2024-09-29 21:02:16,629 - MainThread - INFO - TRD is shutting down...                                                                                                                                                                      │
│ 2024-09-29 21:02:16,630 - MainThread - INFO - --------------------------------------------------------                                                                                                                                     │
│ 2024-09-29 21:02:16,631 - MainThread - INFO - Sensitive operations are in progress!                                                                                                                                                        │
│ 2024-09-29 21:02:16,631 - MainThread - INFO - Please wait while the application is being shut down!                                                                                                                                        │
│ 2024-09-29 21:02:16,632 - MainThread - INFO - --------------------------------------------------------                                                                                                                                     │
│ 2024-09-29 21:02:16,632 - MainThread - INFO - Lock file removed!                                                                                                                                                                           │
│ 2024-09-29 21:02:16,633 - MainThread - INFO - Shutdown due to error!, exit code: 1                                                                                                                                                         │
│ Tezos Reward Distributor (TRD) is Starting

We also modify TRD to add a specific error code for this specific benign
case:
tezos-reward-distributor-organization/tezos-reward-distributor#713

* add link to list of exit codes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants