Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple dmod.scheduler from Docker #556

Open
christophertubbs opened this issue Mar 19, 2024 · 7 comments
Open

Decouple dmod.scheduler from Docker #556

christophertubbs opened this issue Mar 19, 2024 · 7 comments
Labels
enhancement New feature or request initiative A large, high-level task composition, with at least one Initiative or Epic subtask refactor Code Cleanup and Restructuring Urgent This needs to be addressed as soon as possible

Comments

@christophertubbs
Copy link
Contributor

dmod.scheduler has a firm dependence on Docker, which makes sense considering that this is a product that operates through Docker. The hard coupling, however, makes it difficult, if not nigh impossible, to inject an alternative of some sort.

The parts that directly link to Docker should be abstracted out into a common base class with the docker implementation defined within another module. That module should then be referenced dynamically so that dmod.scheduler may be loaded into memory, even on systems without access to docker.

The scope of this issue does not extend to creating alternate implementations. That should be in a task that may be completed after this step.

@christophertubbs christophertubbs added enhancement New feature or request initiative A large, high-level task composition, with at least one Initiative or Epic subtask refactor Code Cleanup and Restructuring Urgent This needs to be addressed as soon as possible labels Mar 19, 2024
@aaraney
Copy link
Member

aaraney commented Mar 20, 2024

dmod.scheduler has a firm dependence on Docker, which makes sense considering that this is a product that operates through Docker.

dmod.monitor is in this camp and to an extent the dmod.dataservice.

The parts that directly link to Docker should be abstracted out into a common base class with the docker implementation defined within another module.

To be fair, the concept of a Job is abstracted from Docker Swarm concepts (i.e. services and tasks). The dmod.scheduler is a service, and I think it is up for debate how generic / configurable a service should be. However, point taken.

The difficulty in this problem is not Docker or Docker Swarm, but exposing data to the computational environment. So, in our case, getting data from the dataservice mounted into one or more swarm services (Jobs that require mpi, each mpi "node" is a separate swarm service). So, going from minio -> s3 docker volume -> running swarm service task.

In our current architecture, the dataservice is aware of the computational environment, in that it knows how to create a volume that is compliant and usable by the scheduler. Adding an abstraction to the scheduler over the current docker swarm implementation will not address this issue.

@christophertubbs
Copy link
Contributor Author

christophertubbs commented Mar 20, 2024

The metric of success here is the ability to write another implementation of some of this logic so that docker may be avoided and for an instance of a configured ngen to run sans container. Let's call it the 'dumb implementation'. Job may not be the correct piece of the puzzle (neither dmod.scheduler nor dmod.monitor are my area of expertise), but the thing that launches and tracks ngen execution needs to be able to support a non-docker environment and preferably a laptop/desktop environment (although this mode of play should never be the version deployed anywhere).

The concept behind the idea isn't really the point; it's the mechanics. The mechanics just need to be able to support the base/dumb/idiotic case all around. For instance, the dataservice shouldn't be a primary docker product either, just a product that can (and probably should) use docker. Liken it to Django - when running locally, it doesn't require gunicorn or nginx, but, when deploying it, it better use a reverse proxy/load balancer and attached webserver. Or imagine using the type hint of typing.MutableMapping - 90% of the time that's just another way of saying dict, but another implementation of a mutable mapping may come along that needs support, yet looks the exact same.

If implemented right, it'll be non-trivial but possible to switch out the used docker implementation for an implementation that can play with a K8 API.

So the overall logic here wouldn't change since we still need to run this using docker swarm, we just need to create places so we can perform a switcheroo.

@hellkite500
Copy link
Member

I'm pretty sure that was reactor long ago so the scheduler and the Launcher are independent components... Specifically with this idea in mind. Are there still specific artifacts lingering in the scheduler itself that are too platform specific? I can dig in a bit and remind myself, it's been a while...

@christophertubbs
Copy link
Contributor Author

The JobManagerFactory requires a Launcher object that is tightly coupled to docker (parent class is even SimpleDockerUtil) and requires the passing of things like hard coded image names (magic strings are sort of another no-no as well).

Forming a means of breaking Launcher objects away from docker functions will probably achieve this. Imagine having stuff like a LocalLauncher, K8Launcher, CloudLauncher, MockLauncher, and DockerLauncher.

@aaraney
Copy link
Member

aaraney commented Mar 20, 2024

@christophertubbs, I get what you are saying that there is a need for a workload management abstraction. What i'm trying to convey is that the operations a workload management abstraction should handle are split between the scheduler (in the Launcher) and the monitor service (in the DockerSwarmMonitor) and how to generically provide data to a workload management abstraction is unclear.

@hellkite500
Copy link
Member

Forming a means of breaking Launcher objects away from docker functions will probably achieve this. Imagine having stuff like a LocalLauncher, K8Launcher, CloudLauncher, MockLauncher, and DockerLauncher.

Not sure exactly when Launcher became a subclass (it started out as its own base class). But wouldn't be too difficult to build a basic ABC and pull the docker specific pieces into its own subclass.

This has been on the wish list for some time, along with deployment support for different backends (e.g. control_stack options for deploying dmod on different systems)

@aaraney
Copy link
Member

aaraney commented Mar 22, 2024

To that point, @hellkite500, if we do go down this road, I think there is a strong argument to combine the some of the functionality of the scheduler and the monitor services. If we were to go that route, I think we can generalize the concept of scheduling and monitoring making it easier to support and maintain existing backends and introduce different backends if needed in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request initiative A large, high-level task composition, with at least one Initiative or Epic subtask refactor Code Cleanup and Restructuring Urgent This needs to be addressed as soon as possible
Projects
None yet
Development

No branches or pull requests

3 participants