Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE-REQUEST] Expose RP "services" as a special (multi-node) task #2543

Closed
srini009 opened this issue Mar 9, 2022 · 4 comments
Closed

Comments

@srini009
Copy link

srini009 commented Mar 9, 2022

Description:
RP currently exposes a "services" abtraction at the pilot level. This services field takes as input a "list of commands" to execute on a dedicated node from the pilot job allocation. Currently, this new feature only supports the execution of a sequence of "non-MPI" commands/programs.

Example use-case:
Consider the situation where we would like to run a performance monitoring service as a part of the pilot job. This performance monitoring service would (at the very least) need to support a distributed database to hold the collected performance data. The database needs to be distributed so as to not be the bottleneck in the overall execution of the RP pilot job. We envision several "clients" to connect to this service to store their pieces of performance data. These "clients" could be user-level RP tasks or other daemons that are spawned on the compute nodes to collect node-level performance data. The intention is for us to use the collected performance data as a means to perform dynamic, adaptive scheduling of future tasks (based on historical observations). Thus, I would like to request the exposes of "services" as a "special RP task" with the following semantics:

  1. At its core, the distributed (monitoring) services are themselves treated as RP tasks, with the exception that these services are considered first-class citizens of the pilot.
  2. "Service" nodes can be more than 1. It is left up to the user's discretion as to how many nodes from the pilot jobs they want to allocate to RP pilot services. These nodes need to be removed from the available set of nodes on which to run "user-level" RP tasks for the duration of the pilot job.
  3. The user can setup custom "pre-exec", "input/output" staging, and "post-exec" commands for each of the service tasks that are spawned.
@andre-merzky
Copy link
Member

andre-merzky commented Mar 14, 2022

two types of services:
(1): 1 process per node (tau, system monitor)
(2): using a separate node (tau, redis)

Note that the first requires changes to the RP task description (see also #2293)

pilot_description.services should become a list of task descriptions

@kartikmodi
Copy link
Contributor

Scope of 1st Phase by 26 Dec -

  1. Creation of services from task description
  2. Callback handling from service task when it's state changes

@radical-cybertools radical-cybertools deleted a comment from kartikmodi Jan 13, 2023
@andre-merzky
Copy link
Member

andre-merzky commented Jan 13, 2023

2nd Phase:

  • add scheduling capabilities to allow services to run on all nodes. This implies supporting a ranks_per_node attribute for the task description.

@mtitov
Copy link
Contributor

mtitov commented Apr 10, 2023

This ticket will be closed in favor #2899

@mtitov mtitov closed this as completed Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants