Open
Description
Is your feature request related to a problem? Please describe.
Current MNMG RF is more like a model-parallel approach. We distribute the data among the workers and also distribute the work of building separate trees on each of them. Each worker then builds a tree based on only the data that is available to it.
Although, this is an embarrassingly parallel approach to build trees in RF. This approach, however, can have some limitations:
- does not work well if the dataset is wide (aka lots of features).
- tree built on a particular worker may not see samples from other workers, which could introduce bias
Describe the solution you'd like
Along with the current approach, we should also be providing an option for users to choose another approach, whose solution is:
- If the rows of the dataset are distributed across the workers, then we need to perform an allReduce of the intermediate histograms among those workers, before computing the best split.
- If the columns of the dataset are distributed across the workers, then we need to perform a max-allReduce of the individual best-splits among those workers to get the “global” best split.
- If both rows and columns are distributed (aka 2D-partitioning of the dataset), then we need to do both 1 and 2.