Skip to content

[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements #2125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from

Conversation

jkbradley
Copy link
Member

Summary:

  1. Variable numBins for each feature [SPARK-3043]
  2. Reduced data reshaping in aggregation [SPARK-3043]
  3. Choose ordering for ordered categorical features adaptively [SPARK-3156]
  4. Changed nodes to use 1-indexing [SPARK-3086]
  5. Small clean-ups

Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).

Speedups: This update helps most when many features use few bins but a few features use many bins. Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):

  • Example where old code was reasonably efficient (1/2 continuous, 1/4 binary, 1/4 20-category): 164.813 --> 116.491 sec
  • Example where old code wasted many bins (1/10 continuous, 81/100 binary, 9/100 20-category): 128.701 --> 39.334 sec

Details:

(1) Variable numBins for each feature [SPARK-3043]

DecisionTreeMetadata now computes a variable numBins for each feature. It also tracks numSplits.

(2) Reduced data reshaping in aggregation [SPARK-3043]

Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.

  • Added ImpurityAggregator and ImpurityCalculator classes, to make DecisionTree code more oblivious to the type of impurity.
  • Design note: I originally tried creating Impurity classes which stored data and storing the aggregates in an Array[Array[Array[Impurity]]]. However, this led to significant slowdowns, perhaps because of overhead in creating so many objects.

The aggregate statistics are never reshaped, and cumulative sums are computed in-place.

Updated the layout of aggregation functions. The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
For this update, the following functions were refactored:

  • updateBinForOrderedFeature
  • updateBinForUnorderedFeature
  • binaryOrNotCategoricalBinSeqOp
  • multiclassWithCategoricalBinSeqOp
  • regressionBinSeqOp
    The above 5 functions were replaced with:
  • orderedBinSeqOp
  • someUnorderedBinSeqOp

Other changes:

  • calculateGainForSplit now treats all feature types the same way.
  • Eliminated extractLeftRightNodeAggregates.

(3) Choose ordering for ordered categorical features adaptively [SPARK-3156]

Updated binsToBestSplit():

  • This now computes cumulative sums of stats for ordered features.
  • For ordered categorical features, it chooses an ordering for categories. (This uses to be done by findSplitsBins.)
  • Uses iterators to shorten code and avoid building an Array[Array[InformationGainStats]].

Side effects:

  • In findSplitsBins: A sample of the data is only taken for data with continuous features. It is not needed for data with only categorical features.
  • In findSplitsBins: splits and bins are no longer pre-computed for ordered categorical features since they are not needed.
  • TreePoint binning is simpler for categorical features.

(4) Changed nodes to use 1-indexing [SPARK-3086]

Nodes used to be indexed from 0. Now they are indexed from 1.
Node indexing functions are now collected in object Node (Node.scala).

(5) Small clean-ups

Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
Eliminated InvalidBinIndex since it is no longer used.

CC: @mengxr @manishamde Please let me know if you have thoughts on this—thanks!

Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
* Added TreePoint representation to avoid calling findBin multiple times.
* (not working yet, but debugging)
Optimization: Added TreePoint representation so we only call findBin once for each example, feature.

Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones.

BUG FIX: isSampleValid
* isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.

Also: some commented-out debugging println calls in DecisionTree, to be removed later
DecisionTree.scala
* Eliminated filters, replaced by building tree on the fly and filtering top-down.
** Aggregation over examples now skips examples which do not reach the current level.
* Only calculate unorderedFeatures once (in findSplitsBins)

Node: Renamed predictIfLeaf to predict

Bin, Split: Updated doc
* Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair.
* Internal doc: findAggForOrderedFeatureClassification
* Created ImpurityAggregator classes, rather than old aggregates.
* Feature split/bin semantics are based on ordered vs. unordered
** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features.
* numBins can differ for each feature

DecisionTree
* Major changes based on new aggregator setup
** For ordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(binIndex).
** For unordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(2 * binIndex),
* Added LearningMetadata class
* Eliminated now-unused functions:
** extractNodeInfo
** getNumSplitsForFeature
** getBinDataForNode (Eliminated since it merely slices/reshapes data.)

ImpurityAggregator classes
* Changed main aggregate operation to create binAggregates (binSeqOp, binCompOp) to use the aggregator.
* Before, for unordered features, the left/right bins were treated as a separate dimension for aggregates.  They are now part of the bins: binAggregates is of size: (numNodes, numBins_f, numFeatures) where numBins_f is:
** 2 * [pow(2, maxFeatureValue - 1) - 1] for unordered categorical features
** maxFeatureValue for ordered categorical features
** maxBins for continuous features

DecisionTreeSuite
* For tests using unordered (low-arity) features, removed checks of Bin.category, which only has meaning for ordered features.
…s incorrect since counts were swapped accidentally
…l features

* Bug: Categorical features were all treated as ordered for binary classification.  This is possible but would require the bin ordering to be determined on-the-fly after the aggregation.  Currently, the ordering is determined a priori and fixed for all splits.
* (Temp) Fix: Treat low-arity categorical features as unordered for binary classification.
* Related change: I removed most tests for isMulticlass in the code.  I instead test metadata for whether there are unordered features.
* Status: The bug may be fixed, but more testing needs to be done.

Aggregates: The same binMultiplier (for ordered vs. unordered) was applied to all features.  It is now applied on a per-feature basis.
…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable
* Updated doc
* Made some methods private

Changed timer to report time in seconds.
Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
	mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
	mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala
	mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@jkbradley
Copy link
Member Author

@mengxr I agree memory limits are a problem.

  1. I am OK with changing maxBins to 32. For maxMemoryInMB, does 256 seem reasonable when the default executor memory is 512?
  2. For choosing the best splits, I agree we could distribute it, but I'm not sure about the gains. The memory requirements might be hard to reduce given the current setup, for each executor needs to update bins for all nodes during its one pass over its data points. If we maintained a mapping from nodes to relevant examples, then I could imagine spawning one job per node; that would reduce the memory requirement but might mean lots more passes over the data. As far as the driver's load, fairly little time is spent outside of aggregation, so I don't think it's a big issue. Am I misunderstanding though?

I could imagine 2 main ways to reduce memory usage without doing more passes over the data:
(a) Simple way: We could use different types to compress data, as others have done.
(b) Fancy way: We could use many fewer maxBins by default but re-bin features at each node to ameliorate the effects of a small number of bins. E.g., the top node might split a continuous feature into bins [0, 20], [20, 40], ... and choose to split at 20; in the next level, the left node might use bins [0, 5], [5, 10], ... [15, 20], and the right node might use bins [20, 30], [30, 40], ....

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2125 at commit aa4e4df.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2125 at commit aa4e4df.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("isMulticlass = " + metadata.isMulticlass)
    • logDebug("isMulticlass = " + metadata.isMulticlass)

@jkbradley
Copy link
Member Author

Jenkins, please test this again.

@mengxr
Copy link
Contributor

mengxr commented Sep 5, 2014

retest this please

@mengxr
Copy link
Contributor

mengxr commented Sep 5, 2014

Jenkins, retest this please.

1 similar comment
@mengxr
Copy link
Contributor

mengxr commented Sep 5, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

QA tests have started for PR 2125 at commit a2acea5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have finished for PR 2125 at commit a2acea5.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Looks like other tests are failing. I might as well push one more tweak (speedup tweak based on profiling).

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have started for PR 2125 at commit 42c192a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have finished for PR 2125 at commit 42c192a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("isMulticlass = " + metadata.isMulticlass)
    • logDebug("isMulticlass = " + metadata.isMulticlass)

@mengxr
Copy link
Contributor

mengxr commented Sep 8, 2014

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 711356b Sep 8, 2014
asfgit pushed a commit that referenced this pull request Sep 9, 2014
Adjust the default values of decision tree, based on the memory requirement discussed in #2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
@jkbradley jkbradley deleted the dt-opt3alt branch October 8, 2014 21:21
andmarios pushed a commit to andmarios/spark-examples that referenced this pull request Mar 20, 2015
Adjust the default values of decision tree, based on the memory requirement discussed in apache/spark#2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants