Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fusion support to Condor executor #3697

Closed
wants to merge 1 commit into from
Closed

Conversation

bentsherman
Copy link
Member

@JosephLalli this PR is ready for you to test. Here is the quickstart to build and test locally:

# build
git clone -b fusion-condor git@github.com:nextflow-io/nextflow.git
cd nextflow
make compile

# test
../launch.sh run main.nf ...

Just keep in mind that we haven't tested Fusion on MinIO-based S3 compatible storage yet, so that part might not work.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@pditommaso
Copy link
Member

Adding for your convenience the required configuration to enable Fusion and Wave with Minio

aws {
  client {
    endpoint = '<MINIO ENDPOINT>'
    s3PathStyleAccess = true
  }
  accessKey = '<ACCESS>'
  secretKey = '<SECRET>'
}

wave { enabled = true }

fusion { enabled = true; exportAwsAccessKeys = true }

@JosephLalli
Copy link

Just wanted to let @bentsherman and @pditommaso know that, after some back and forth with our sysadmins, I finally have a nextflow-condor container with the proper minimal configuration to have nextflow submit and monitor jobs on our condor system.

Which means I am finally ready to start actually investigating/building off of this branch.

Next steps:

  1. Work on nextflow's support for running containerized jobs with condor
  2. Test fusion integration.

Thank you, and I'll keep you posted!

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 0d59b4c to b93634e Compare March 11, 2023 11:20
@pditommaso pditommaso marked this pull request as draft April 9, 2023 14:39
@pditommaso
Copy link
Member

Closing for lack of activity. Feel free to comment or reopen if needed.

@pditommaso pditommaso closed this Jul 5, 2023
JosephLalli added a commit to JosephLalli/nextflow that referenced this pull request Jan 9, 2024
@JosephLalli
Copy link

JosephLalli commented Jan 11, 2024

This is a project that I unfortunately can only work on when I have time to spare, so apologies for the stop-start nature of the work.

I have figured out one point of confusion that is causing some of these issues: Condor isn't a grid executor. At least, not really. It's much more like a grid executor + a Nextflow-style overlay that abstracts out much of nuts and bolts of generating bash scripts that can run in multiple environments. One further challenge is that few features have been depreciated over 20+ years of development. Much of the documentation out there will describe a method of working with Condor as preferred, only to have been superseded in the past 5 years by a new method that partially breaks the old method.

So, for example. Condor cannot take an executable in via stdin. It can take a condor_submit file via stdin, but that submit file must refer to a saved executable file. It will not even run "echo hello world" as an executable - that command has to be wrapped in a saved bash script. (This is because Condor anticipates needing to make nextflow-style changes to the executable to allow for machine-agnostic execution.) Condor appears to be very powerful, and there are a few features I would recommend trying to steal from it. (Job prioritization, for example. Preemptable jobs, windows/linux flexibility, or multiple executables dependent on machine specifications [eg different instruction sets, or using memory-lite, IO heavy code if running on a machine with an NVME]).

@JosephLalli
Copy link

I am currently trying to find a work around for the executable file issue, as it appears to be incompatible with Grid Fusion as written. Grid Fusion submits '#!/bin/bash\n' + submitDirective(task) + cmd + '\n' via stdin to the command, when ideally it would submit submitDirective(task) directly, with a wrapper file specified in the directive.

While I previously suggested that docker container support would require use of the "docker universe" specification, I have since learned that specifying a "docker universe" just creates a wrapper script a la Nextflow. For example, here is a suggestion on how to use a Singularity container that simply runs a singularity command in a vanilla universe. Since this article was written, HTCondor now uses a separate "container universe" to support singularity.

@JosephLalli
Copy link

It seems to me like there are a few different paths you could go down @bentsherman, depending on how you want to structure this feature:

  1. Generate a bash script for each job specifying how the remote machine should load the fusion container, and submit condor_submit directives in the vanilla universe which reference that file via stdin. This is what I would recommend, as it allows for the most flexibility when working with condor.
    1a) This can be a bash script stored in the cloud if needed, though you would need to rely on Condor to create the s3 connection.

  2. You could generate a docker universe submission file that provides all of the environment values and docker configuration values in Condor submission file syntax, including a default execution file of /fusion/s3/workdir/.command.run

  3. You could use condor_qsub to submit files. condor_qsub (documentation here) is a method of submitting SGE or PBS/Torque style jobs to a condor scheduler. In theory, you can just generate files using the preexisting PBS fusion code, and submit to 'condor_qsub' instead of 'qsub'. However, condor_qsub only implements a subset of PBS/SGE/Torque features. I cannot find a good listing of what that subset is. A few different external 'how to use 'condor'' websites suggest that "your mileage may vary if you try 'exotic' things. For Condor's full functionality and feature set better migrate to the native Condor tools ASAP." Finally, the code implementing qsub has not been updated since 2018.

I have ranked these in order of what I think best aligns with Condor's intended use. However, I'm not sure if it is in your code design to generate a CondorTaskHandler to extend the GridTaskHandler and modify Fusion submission code as needed. I might need to do something along those lines to get my nf-condor branch running.

@bentsherman
Copy link
Member Author

I think your best option then is to write the submit script to a temp file and submit it to condor as you described in (1). The nf-float plugin also does this for a similar executor, so you can follow their example.

As for Condor's Nextflow-like features, it is not unusual for batch schedulers to have additional features like this to make them more usable for native users. Nextflow generally ignores these extra features in favor of its own. If the docker universe just adds a wrapper script without e.g. affecting how infra is provisioned, you should just re-use Nextflow's existing code to wrap the task script in a docker command (in fact Grid Fusion should do this for you).

@JosephLalli
Copy link

JosephLalli commented Jan 12, 2024

I've decided to pursue option #⁠2, translating nextflow's container construction output into a condor submit file in a 'container' universe.

Unfortunately, testing on UWisc's network shows that the Condor account does not have permissions on execute machines to launch a docker job unless it launches through a condor submit file.

The only hiccup I'm encountering w/r/t passing docker run arguments through condor is that I cannot run docker with a --privileged flag. For security reasons, Condor does not allow users to specify docker run flags directly. However, most flags can be translated into condor submitfile commands. I understand that fusion doesn't strictly need that flag to function.

@JosephLalli
Copy link

With regards to testing: Is there a method of running CI tests in a container that contains condor? How do you integrate your test environments with github?

@bentsherman
Copy link
Member Author

We don't run any CI tests with real HPC schedulers, just unit tests with mocks

You will need to build Nextflow locally and test it with your Condor installation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants