Skip to content

[FEA] Support for other ways to do MNMG-RF #3539

Open
@teju85

Description

@teju85

Is your feature request related to a problem? Please describe.
Current MNMG RF is more like a model-parallel approach. We distribute the data among the workers and also distribute the work of building separate trees on each of them. Each worker then builds a tree based on only the data that is available to it.

Although, this is an embarrassingly parallel approach to build trees in RF. This approach, however, can have some limitations:

  1. does not work well if the dataset is wide (aka lots of features).
  2. tree built on a particular worker may not see samples from other workers, which could introduce bias

Describe the solution you'd like
Along with the current approach, we should also be providing an option for users to choose another approach, whose solution is:

  1. If the rows of the dataset are distributed across the workers, then we need to perform an allReduce of the intermediate histograms among those workers, before computing the best split.
  2. If the columns of the dataset are distributed across the workers, then we need to perform a max-allReduce of the individual best-splits among those workers to get the “global” best split.
  3. If both rows and columns are distributed (aka 2D-partitioning of the dataset), then we need to do both 1 and 2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions