-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
Maybe scheduling could also work on custom labels. For example, I could run a worker saying that it runs on a machine with datasets_available
and then I could have a job which would require that it is scheduled only on a machine/worker with datasets_available
label.
So initially we thought of using a distributed file system mounted on every worker's machine and then have workers access datasets as needed. But that would require scheduling to know if a dataset is already moved to a worker by the distributed file system or not (to schedule jobs close to data, if possible). But I think it could be simpler if only one machine has datasets and then jobs read it as needed and store them into object store and then Ray can transport those objects around as needed. We are working anyway with datasets which can fit into memory, so it seems this is the best approach anyway. And if we get to the phase of larger datasets, locally serializing them from object store to a cache on local drive seems even much better than whole dataset parsing from scratch. (Although, it depends on compression; uncompressed images can be much larger than compressed ones.)