-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fusion support to Condor executor #3697
Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Adding for your convenience the required configuration to enable Fusion and Wave with Minio
|
Just wanted to let @bentsherman and @pditommaso know that, after some back and forth with our sysadmins, I finally have a nextflow-condor container with the proper minimal configuration to have nextflow submit and monitor jobs on our condor system. Which means I am finally ready to start actually investigating/building off of this branch. Next steps:
Thank you, and I'll keep you posted! |
0d59b4c
to
b93634e
Compare
Closing for lack of activity. Feel free to comment or reopen if needed. |
This is a project that I unfortunately can only work on when I have time to spare, so apologies for the stop-start nature of the work. I have figured out one point of confusion that is causing some of these issues: Condor isn't a grid executor. At least, not really. It's much more like a grid executor + a Nextflow-style overlay that abstracts out much of nuts and bolts of generating bash scripts that can run in multiple environments. One further challenge is that few features have been depreciated over 20+ years of development. Much of the documentation out there will describe a method of working with Condor as preferred, only to have been superseded in the past 5 years by a new method that partially breaks the old method. So, for example. Condor cannot take an executable in via stdin. It can take a condor_submit file via stdin, but that submit file must refer to a saved executable file. It will not even run "echo hello world" as an executable - that command has to be wrapped in a saved bash script. (This is because Condor anticipates needing to make nextflow-style changes to the executable to allow for machine-agnostic execution.) Condor appears to be very powerful, and there are a few features I would recommend trying to steal from it. (Job prioritization, for example. Preemptable jobs, windows/linux flexibility, or multiple executables dependent on machine specifications [eg different instruction sets, or using memory-lite, IO heavy code if running on a machine with an NVME]). |
I am currently trying to find a work around for the executable file issue, as it appears to be incompatible with Grid Fusion as written. Grid Fusion submits While I previously suggested that docker container support would require use of the "docker universe" specification, I have since learned that specifying a "docker universe" just creates a wrapper script a la Nextflow. For example, here is a suggestion on how to use a Singularity container that simply runs a singularity command in a vanilla universe. Since this article was written, HTCondor now uses a separate "container universe" to support singularity. |
It seems to me like there are a few different paths you could go down @bentsherman, depending on how you want to structure this feature:
I have ranked these in order of what I think best aligns with Condor's intended use. However, I'm not sure if it is in your code design to generate a CondorTaskHandler to extend the GridTaskHandler and modify Fusion submission code as needed. I might need to do something along those lines to get my nf-condor branch running. |
I think your best option then is to write the submit script to a temp file and submit it to condor as you described in (1). The nf-float plugin also does this for a similar executor, so you can follow their example. As for Condor's Nextflow-like features, it is not unusual for batch schedulers to have additional features like this to make them more usable for native users. Nextflow generally ignores these extra features in favor of its own. If the docker universe just adds a wrapper script without e.g. affecting how infra is provisioned, you should just re-use Nextflow's existing code to wrap the task script in a docker command (in fact Grid Fusion should do this for you). |
I've decided to pursue option #2, translating nextflow's container construction output into a condor submit file in a 'container' universe. Unfortunately, testing on UWisc's network shows that the Condor account does not have permissions on execute machines to launch a docker job unless it launches through a condor submit file. The only hiccup I'm encountering w/r/t passing docker run arguments through condor is that I cannot run docker with a --privileged flag. For security reasons, Condor does not allow users to specify docker run flags directly. However, most flags can be translated into condor submitfile commands. I understand that fusion doesn't strictly need that flag to function. |
With regards to testing: Is there a method of running CI tests in a container that contains condor? How do you integrate your test environments with github? |
We don't run any CI tests with real HPC schedulers, just unit tests with mocks You will need to build Nextflow locally and test it with your Condor installation |
@JosephLalli this PR is ready for you to test. Here is the quickstart to build and test locally:
Just keep in mind that we haven't tested Fusion on MinIO-based S3 compatible storage yet, so that part might not work.