Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Execution #128

Open
cfhammill opened this issue Nov 26, 2018 · 9 comments
Open

Cluster Execution #128

cfhammill opened this issue Nov 26, 2018 · 9 comments

Comments

@cfhammill
Copy link

I'm interested in running funflow pipelines shipping external jobs to a cluster scheduler e.g. torque/slurm.

I was hoping to get some ideas on how to do this. I'm happy to write code and contribute it back to funflow if it doesn't exist yet.

  • For distribution, should this be done with funflow-jobs, via an additional coordinator?
  • For resource management, should this be done with a State effect? I can imagine resources being combined as the flow is composed. For example, sequential composition of two resource holding flows would have the max of the two memory requirements / ncores / nodes, the sum of the walltimes. mapA would multiply the requirements of memory/cores/nodes.
  • Is there a uniform cluster resource spec that could be used to control dispatching to multiple clusters? Something like what exists in cwl/xenon maybe?
  • Would it be worth writing integrations for dahll configs of external tools, a bit like a typed/programmable cwl spec?

Maybe some of these can be fragmented out as separate issues, let me know what I can do.

@nc6
Copy link
Member

nc6 commented Nov 26, 2018

Hi! Great that you're looking into doing this.

In terms of the first point, I would tend to add an executor specific to the scheduler - we considered doing this for LSF, so I'd imagine this would be similar. An existing co-ordinator could be used, but one could simply submit the tasks using srun or similar.

Resource management - yeah, I'd like ultimately to have something integrated into funflow for this, but until we do so then using a state model seems appropriate. #125 might be useful for you here.

CWL has a resource requirement specification (https://www.commonwl.org/v1.0/CommandLineTool.html#ResourceRequirement), but it's not exactly comprehensive. There might be a better spec out there, though I haven't done much searching.

I'm a little sceptical about dhall configs - every time I've used dhall for something, I've found it a difficult fit. In particular, where the underlying language is less typed, one has to reflect the validation at the dhall level as well, plus keep adapting the dhall spec to the underlying language. That having been said, in specific cases, maybe this would be a reasonable thing to do!

@cfhammill
Copy link
Author

Thanks!

That first part makes sense. Having an executor per cluster seems like a good approach. Especially if they can use a common set of resources as in point 2. Do you have any code for the LSF case I could look at? I think LSF is quite similar to slurm, one of the two schedulers I'm most interested in getting working.

I had a quick look at that issue, not sure I immediately understand the consequences, but I'll puzzle over it a bit. I think its suggesting that I can arbitrarily stack applicatives on top of my arrows without munging the internal effect. If so I think that is exactly what I want.

I'll check out the CWL resource spec and see if it has most of the stuff I think I need.

And for the dhall configs, I'm not imagining rendering the dhall back to JSON for use with CWL. I'm imagining a dhall alternative to the command-line tool spec used in CWL. These dhall configs could be used to maintain defaults and generate CLIs for running whole pipelines. Although I now see this is a totally separate issue, other than that I think the structure of the resource spec and command line tool spec should be similar.

Thanks so much for your help and for this amazing library!

@kaizhang
Copy link

kaizhang commented Dec 5, 2018

@cfhammill I'm rewriting my scientific workflow manager based on funflow. Currently I've implemented a DRMAA coordinator (compatible with slurm or sge) that can submit native haskell codes to remote compute nodes. If you are interested, you can take a look at this dev branch: https://github.com/kaizhang/SciFlow/tree/v1.0.

Here is an example application:

https://github.com/kaizhang/SciFlow/blob/v1.0/tests/socket.hs

@cfhammill
Copy link
Author

cfhammill commented Dec 5, 2018

This is really cool @kaizhang. I hadn't thought to use drmaa. That might solve a big chunk of what we're hoping to accomplish. Right now we're mostly concerned with wrapping external programs and running them on the cluster, but I can definitely imagine wanting to run native code in the future

@cfhammill
Copy link
Author

@kaizhang, do you think your coordinator code could be PRed into funflow? Seems like it would be good to have with the rest of the coordinators.

@kaizhang
Copy link

kaizhang commented Dec 8, 2018

In funflow the coordinator is used to distribute external tasks. But my coordinator can distribute arbitrary Haskell codes. Because the purpose is different, the interfaces of the two coordinators are quite different. So this cannot be merged into funflow without significant changes to the coordinator interface.

But I would like to share my design and let tweag folks decide whether this should be merged into funflow:

Because flows are free arrows, I can write multiple interpreters easily for the same type of workflow. I wrote two interpreters -- one for the frontend (coordinator) and the other for the compute nodes (executor). On the frontend, the coordinator spawns new workers and assign steps to workers (using DRMAA). On the compute nodes, the executor queries the job assignments from the coordinator through the socket.

mainWith :: (MonadMask m, MonadIO m, MonadBaseControl IO m)
         => RunMode
         -> FilePath
         -> DrmaaConfig
         -> SciFlow m a b
         -> a
         -> m b
mainWith runMode p config wf input = do
    dir <- liftIO $ makeAbsolute p >>= parseAbsDir
    res <- CS.withStore dir $ \store -> case runMode of
        Master -> withDrmaa config $ \d ->
            runCoordinator d store wf input             -- On master node
        Slave -> withConnection config $ \conn ->
            runSciFlow conn store wf input              -- On compute nodes

I think such design patterns can be used for other types of distributed computing as well (not limited to DRMAA). And it solves the problem of distributing native Haskell codes.

@kaizhang
Copy link

I'm thinking can we combine funflow and Cloud Haskell together to distribute the workflow?

@cfhammill
Copy link
Author

I was somewhat thinking the same thing. Could unify the external/jobs interfaces.

@kaizhang
Copy link

@cfhammill In case you are still interested, I've developed a prototype that uses Cloud Haskell to distribute workflows. This is an example code:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE QuasiQuotes #-}

import Control.Monad.Reader
import Control.Concurrent (threadDelay)
import System.Environment

import Control.Workflow
import Control.Workflow.Coordinator.Local

s0 :: () -> ReaderT Int IO [Int]
s0 = return . const [1..10]

s1 :: Int -> ReaderT Int IO Int
s1 i = liftIO $ do
    threadDelay 5000000
    return i

s2 = return . (!!1)
s3 = return . (!!2)
s4 = return . (!!3)
s5 = return . (!!4)
s6 = return . (!!5)
s7 = return . (!!6)
s8 (a,b,c,d,e,f) = liftIO $ print [a,b,c,d,e,f]
s9 = liftIO . print 
    
build "wf" [t| SciFlow Int |] $ do
    node "S0" 's0
    nodePar "S1" 's1
    ["S0"] ~> "S1"

    node "S2" 's2
    node "S3" 's3
    node "S4" 's4
    node "S5" 's5
    node "S6" 's6
    node "S7" 's7
    ["S0"] ~> "S2"
    ["S0"] ~> "S3"
    ["S0"] ~> "S4"
    ["S0"] ~> "S5"
    ["S0"] ~> "S6"
    ["S0"] ~> "S7"

    node "S8" 's8
    ["S2", "S3", "S4", "S5","S6","S7"] ~> "S8"

    node "S9" 's9
    ["S1"] ~> "S9"

main :: IO ()
main = do
    [n] <- getArgs
    mainWith defaultMainOpts{_n_workers = read n} 100 wf

To use the DRMAA backend, just replace "Control.Workflow.Coordinator.Local" with "Control.Workflow.Coordinator.DRMAA". This is currently just a proof-of-concept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants