Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): Asynchronous server-side background task execution #4317

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

whisperity
Copy link
Contributor

@whisperity whisperity commented Aug 17, 2024

Important

This is patch 1 of the Asynchronous Store Protocol (#3672).

This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load, occupancy, and potentially detrimental blocking behaviour on API handler processes, occupied by long-running operations in the web-server, such as massStoreRun().

API workers vs. background/task workers

In this patch, the processing model of the CodeChecker server is extended with the concept of background workers. These special processes deal with consuming tasks off of the server's queue, and executing them. The previous clone processes of a server are now and henceforth termed API workers, and they do the same as prior: respond to Thrift RPC calls.

The server will start background_worker_processes number of background processes, which is, by default, the same number as worker_processes, the number of API handlers. (Which defaults to $(nproc).)

Self-healing against dead children

A fix is also implemented in this patch which is the result of a side effect that stemmed from reviewing and reworking both the threading and the signalling model of a CodeChecker server process tree. Previously, exceptions escaping from an API worker process (or it getting OOM-killed, etc.) would result in an initially large number of workers to die off one by one, leaving the server with a monotonously descending number of workers to dispatch RPC requests to. In this patch, the server is scripted to handle SIGCHLD and respawn dead child processes (both API and Task) when given the opportunity.

Graceful server shutdown

The server's life cycle and signal handling logic refactoring also includes changes which makes the server process(es) more graceful in their termination, which is needed to accurately clean-up pending tasks (see later).

In case a brutal and unclean server shutdown is needed, following this patch, the SIGINT or SIGTERM signal must be sent twice to the server.
(Alternatively, just use SIGKILL! 😉)

Task management

A Task is represented by two things in strict co-existence: a lightweight, pickle-able implementation of the task's execution logic (subclassed from AbstractTask and overriding _implementation()); and a corresponding BackgroundTask entity in the database. These records reside in the CONFIGURATION database, and the task information is server-, or service-wide, shared across all products.

The database-synchronised record contains several human-readable metadata about the task, such as a one-line summary field and the larger, log-like comments column, with several timestamps recorded for the crucial moments during the task's execution.

The most important flag is a task's status, which can be: ALLOCATED, ENQUEUED, RUNNING, COMPLETED, FAILED, CANCELLED, and DROPPED.

Normal path

The life cycle of a task is as follows:

  1. A Task is created as the result of an API request (executed in the API worker process).
    a. (Currently, for the sake of both testing and demonstration, only the createDummyTask() API, and only in a test context, can create a task.)
  2. A unique identifier, termed a task token is ALLOCATED, together with the BackgroundTask database entity.
  3. The API worker pushes the Task object into a shared, synchronised queue within the server's memory space.
    At this point, the task is considered ENQUEUED.
    a. AbstractTask subclasses MUST be pickle-able and reasonably small.
    b. The library offers means to store additional large data on the file system, in a temporary directory specific to the task.
  4. The task token is returned to the user via the RPC API call, and the API worker is free too respond to other requests.
  5. In a loop with some frequency, the user exercises the getTaskInfo() API (executed in the context of any API worker process, synchronised over the database) to query whether the task was completed, if the user wishes to receive this information.
  6. A background worker process (which is a separate context, but executed in the same process tree as the server's API requests — there is NO publishing of tasks between machines!) wakes up to the enqueuement, and pops the Task object from the queue.
    After some bookkeeping, the task will be RUNNING.
  7. MyTaskClass::_implementation() is called, which executes the task's primary business logic.
  8. If _implementation() returns without a failure, the task will be considered successfully COMPLETED.
    Any exception escaping from the method will set the task to FAILED, and exception information is logged into the BackgroundTask.comments column of the database.
    Together, these two are the "Normal termination states.".
  9. Eventually, the user will receive (if waiting for this) the fact that the task had completed.
  10. The background worker process goes back to sleep until a new Task is available from the queue.

Abnormal path 1: admin cancellation

At any point following ALLOCATED status, but most likely in the ENQUEUED and RUNNING statuses, a SUPERUSER may issue a cancelTask() order.
This will set BackgroundTask.cancel_flag, and the task is expected (although not required!) to poll its own should_cancel() status internally in checkpoints, and terminate gracefully to this request. This is done by _implementation() exiting by raising a TaskCancelHonoured exception.
(If the task does not raise one, it will be allowed to conclude normally, or fail in some other manner.
Tasks cancelled gracefully will have the CANCELLED status.

For example, a background task that performs an action over a set of input files generally should be implemented like this:

def _implementation(tm: TaskManager):
    for file in INPUTS:
      if tm.should_cancel(self):
          ROLLBACK()
          raise TaskCancelHonoured(self)

      DO_LOGIC(file)

Abnormal path 2: server shutdown

Alternatively, at any point in this life cycle, the server might receive the command to terminate itself (kill signals SIGINT, SIGTERM; alternatively caused by CodeChecker server --stop). Following the termination of API workers, the background workers will also shut down one by one.
At this point, the default behaviour is to cause a special cancel event which tasks currently RUNNING may still gracefully honour, as-if it was a SUPERUSER's single-task cancel request. All other tasks that have not started executing yet and are in the ALLOCATED or ENQUEUED status will never start.

All tasks not in a normal termination state will be set to the DROPPED status, with the comments field containing a log about the specifics of in which state the task was dropped, and why. (Together, CANCELLED and DROPPED are the "abnormal termination states", indicating that the task terminated due to some external influence.)

Task querying

The getTaskInfo() API, querying the status of one task, is available to the user who caused the task to spawn, the PRODUCT_ADMINs of the product associated with the task (if any), and SUPERUSERs.

The getTasks() API, which queries multiple tasks based on a filter set, is available only to PRODUCT_ADMINs (results restricted to the products they are admin thereof) and SUPERUSERs (unrestricted).
Just about anything that is available as information in the database about a task can be queried.

--machine-id

Unfortunately, servers don't always terminate gracefully (cue the aforementioned SIGKILL, but also the container, VM, or the host machine could simply die during execution, in ways the server is not able to handle). Because tasks are not shared across server processes, and there are crucial bits of information in the now dead process's memory which would have been needed to execute the task, a server later restarting in place of a previously dead one should be able to identify which tasks its "predecessor" left behind without clean-up.

This is achieved by storing the running computer's identifier, configurable via CodeChecker server --machine-id, as an additional piece of information for each task. By default, the machine ID is constructed from gethostname():portnumber, e.g., cc-server:80.

In containerised environments, relying on gethostname() may not be entirely stable!
For example, Docker exposes the first 12 digits of the container's unique hash as the "hostname" of the insides of the container. If the container is started with --restart always or --restart unless-stopped, then this is fine, however, more advanced systems, such as Docker swarm will create a new container in case the old one died (!), resulting in a new value of gethostname().

In such environments, service administrators must pay additional caution and configure their instances by setting --machine-id for subsequent executions of the "same" server accordingly. If a server with machine ID M starts up (usually after a container or "system" restart), it will set every task not in any "termination states" and associated with machine ID M to the DROPPED status (with an appropriately formatted comment accompanying), signifying that the previous instance "dropped" these tasks, but had no chance of recording this fact.

@whisperity whisperity added enhancement 🌟 API change 📄 Content of patch changes API! database 🗄️ Issues related to the database schema. RDY-OnHold 🛑 Patch reviewed and ready, but don't merge due to having to merge a dependent patch first. server 🖥️ refactoring 😡 ➡️ 🙂 Refactoring code. config ⚙️ labels Aug 17, 2024
@whisperity whisperity added this to the release 6.25.0 milestone Aug 17, 2024
@whisperity whisperity force-pushed the feat/server/asynchronous-store-protocol/patch/2-task-management branch 3 times, most recently from 3484290 to 60a262c Compare August 19, 2024 17:11
@whisperity whisperity force-pushed the feat/server/asynchronous-store-protocol/patch/2-task-management branch from 60a262c to 2361221 Compare September 17, 2024 15:34
@whisperity whisperity removed the RDY-OnHold 🛑 Patch reviewed and ready, but don't merge due to having to merge a dependent patch first. label Sep 17, 2024
@whisperity whisperity force-pushed the feat/server/asynchronous-store-protocol/patch/2-task-management branch 2 times, most recently from ed75dbc to a75244f Compare September 18, 2024 07:25
if db_task.username == self._get_username():
has_right_to_query_status = True
should_set_consumed_flag = db_task.is_in_terminated_state
elif db_task.product_id is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this elif? Why is it mutually exclusive if username or product is provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because different workflows are executed. You can only query the status of the task if:

  1. You are the user who was responsible for creating the task. (In this case, but only for terminated tasks, the consumed flag is set to true.)
  2. The task is associated with a product (this might not always be the case!) and you are an admin of said product. (In this case, the consumed flag is left intact.)
  3. You are a superuser. (In this case, the consumed flag is left intact.)

If this was not an elif, a product admin or superuser querying their own task would leave the consumed flag intact.

with DBSession(self._database_factory) as session:
try:
db_task = session.query(DBTask).get(task_obj.token)
session.expunge(db_task)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal of this expunge()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removes the DBTask object from the Session's internal cache. Results in clearing the __sa_state member, tearing off the connection between the Python object and the ORM-handled "view" of it. This ensures that callers of get_task_record(), even if they capture a reference to what is being returned here, are NOT allowed to mutate the state of the object in a way that the database state is also mutated.

(In order to safely mutate a DBTask entity, _mutate_task_record exists.)

to communicate large inputs (that should not be put in the `Queue`)
to the `execute` method of an `AbstractTask`.
"""
task_tmp_root = Path(tempfile.gettempdir()) / "codechecker_tasks" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a real question, I'm just thinking loudly: we put things under /tmp, because it's server side, right? Only the client is using the workspace directory instead of /tmp for temporary files, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is all server-side. Just a temporary directory that's more identifiable than a truly "random" temporary directory for the task's tmp data (like the unzip of a mass-store-run). Because servers can fail in a way that the cleaning of temp directories are not executed, as we are no longer using the tempfile... context managers here.

LOG.warning("Failed to remove background task's data_dir at "
"'%s':\n%s", self.data_path, str(ex))

def _implementation(self, _task_manager: "TaskManager") -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it get an @abstractmethod annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, @abstractmethod isn't a built-in, but coming from the abc library, which makes the object not pickleable.

Also, methods that raise NotImplementedError() are type-checked by PyRight and there is a linter warning in the editor that your class is not proper.

web/api/tasks.thrift Outdated Show resolved Hide resolved
@whisperity whisperity force-pushed the feat/server/asynchronous-store-protocol/patch/2-task-management branch from a75244f to d2bb4d5 Compare September 26, 2024 12:45
This patch implements the whole support ecosystem for server-side
background tasks, in order to help lessen the load (and blocking) of API
handlers in the web-server for long-running operations.

A **Task** is represented by two things in strict co-existence: a
lightweight, `pickle`-able implementation in the server's code (a
subclass of `AbstractTask`) and a corresponding `BackgroundTask`
database entity, which resides in the "configuration" database (shared
across all products).
A Task is created by API request handlers and then the user is
instructed to retain the `TaskToken`: the task's unique identifier.
Following, the server will dispatch execution of the object into a
background worker process, and keep status synchronisation via the
database.
Even in a service cluster deployment, load balancing will not interfere
with users' ability to query a task's status.

While normal users can only query the status of a single task (which is
usually automatically done by client code, and not the user manually
executing something); product administrators, and especially server
administrators have the ability to query an arbitrary set of tasks using
the potential filters, with a dedicated API function (`getTasks()`) for
this purpose.

Tasks can be cancelled only by `SUPERUSER`s, at which point a special
binary flag is set in the status record.
However, to prevent complicating inter-process communication,
cancellation is supposed to be implemented by `AbstractTask` subclasses
in a co-operative way.
The execution of tasks in a process and a `Task`'s ability to
"communicate" with its execution environment is achieved through the new
`TaskManager` instance, which is created for every process of a server's
deployment.

Unfortunately, tasks can die gracelessly if the server is terminated
(either internally, or even externally).
For this reason, the `DROPPED` status will indicate that the server has
terminated prior to, or during a task's execution, and it was unable to
produce results.
The server was refactored significantly around the handling of subprocesses
in order to support various server shutdown scenarios.

Servers will start `background_worker_processes` number of task handling
subprocesses, which are distinct from the already existing "API
handling" subprocesses.
By default, if unconfigured, `background_worker_processes` is equal to
`worker_processes` (the number of API processes to spawn), which is
equal to `$(nproc)` (CPU count in the system).

This patch includes a `TestingDummyTask` demonstrative subclass of
`AbstractTask` which counts up to an input number of seconds, and each
second it gracefully checks whether it is being killed.
The corresponding testing API endpoint, `createDummyTask()` can specify
whether the task should simulate a failing status.
This endpoint can only be used from, but is used extensively, the unit
testing of the project.

This patch does not include "nice" or "ergonomic" facilities for admins
to manage the tasks, and so far, only the server-side of the
corresponding API calls are supported.
@whisperity whisperity force-pushed the feat/server/asynchronous-store-protocol/patch/2-task-management branch from d2bb4d5 to 19302e4 Compare September 26, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change 📄 Content of patch changes API! config ⚙️ database 🗄️ Issues related to the database schema. enhancement 🌟 refactoring 😡 ➡️ 🙂 Refactoring code. server 🖥️
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants