Description
Users may use code which uses multithreading automatically, yet currently Dagger assumes that a thunk only uses one Dagger processor at a time (i.e. one CPU thread or GPU SM/CU). This causes Dagger to oversubscribe a worker with many multithreaded thunks, which can hurt cache locality quite severely.
We should let thunks tell the scheduler when they use X number of similar processors, and also when they automatically scale to the number of similar processors on a server. This could be implemented by a new ThunkOptions
field which takes a callable, which when passed the processor instance or type, returns a Dict
mapping from processor type to the maximum number of similar processors it will utilize. This information should be propagated to the destination worker, and if the worker would become oversubscribed by executing the thunk, the worker will wait to schedule it until other thunks finish and free up an appropriate number of resources. If we cache worker processor hierarchies (something we don't currently do), we can also use this information to inform initial worker selection.