Skip to content

Scheduling based on labels #695

@mitar

Description

@mitar

Maybe scheduling could also work on custom labels. For example, I could run a worker saying that it runs on a machine with datasets_available and then I could have a job which would require that it is scheduled only on a machine/worker with datasets_available label.

So initially we thought of using a distributed file system mounted on every worker's machine and then have workers access datasets as needed. But that would require scheduling to know if a dataset is already moved to a worker by the distributed file system or not (to schedule jobs close to data, if possible). But I think it could be simpler if only one machine has datasets and then jobs read it as needed and store them into object store and then Ray can transport those objects around as needed. We are working anyway with datasets which can fit into memory, so it seems this is the best approach anyway. And if we get to the phase of larger datasets, locally serializing them from object store to a cache on local drive seems even much better than whole dataset parsing from scratch. (Although, it depends on compression; uncompressed images can be much larger than compressed ones.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions