[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements #2125

jkbradley · 2014-08-26T03:41:52Z

Summary:

Variable numBins for each feature [SPARK-3043]
Reduced data reshaping in aggregation [SPARK-3043]
Choose ordering for ordered categorical features adaptively [SPARK-3156]
Changed nodes to use 1-indexing [SPARK-3086]
Small clean-ups

Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).

Speedups: This update helps most when many features use few bins but a few features use many bins. Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):

Example where old code was reasonably efficient (1/2 continuous, 1/4 binary, 1/4 20-category): 164.813 --> 116.491 sec
Example where old code wasted many bins (1/10 continuous, 81/100 binary, 9/100 20-category): 128.701 --> 39.334 sec

Details:

(1) Variable numBins for each feature [SPARK-3043]

DecisionTreeMetadata now computes a variable numBins for each feature. It also tracks numSplits.

(2) Reduced data reshaping in aggregation [SPARK-3043]

Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.

Added ImpurityAggregator and ImpurityCalculator classes, to make DecisionTree code more oblivious to the type of impurity.
Design note: I originally tried creating Impurity classes which stored data and storing the aggregates in an Array[Array[Array[Impurity]]]. However, this led to significant slowdowns, perhaps because of overhead in creating so many objects.

The aggregate statistics are never reshaped, and cumulative sums are computed in-place.

Updated the layout of aggregation functions. The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
For this update, the following functions were refactored:

updateBinForOrderedFeature
updateBinForUnorderedFeature
binaryOrNotCategoricalBinSeqOp
multiclassWithCategoricalBinSeqOp
regressionBinSeqOp
The above 5 functions were replaced with:
orderedBinSeqOp
someUnorderedBinSeqOp

Other changes:

calculateGainForSplit now treats all feature types the same way.
Eliminated extractLeftRightNodeAggregates.

(3) Choose ordering for ordered categorical features adaptively [SPARK-3156]

Updated binsToBestSplit():

This now computes cumulative sums of stats for ordered features.
For ordered categorical features, it chooses an ordering for categories. (This uses to be done by findSplitsBins.)
Uses iterators to shorten code and avoid building an Array[Array[InformationGainStats]].

Side effects:

In findSplitsBins: A sample of the data is only taken for data with continuous features. It is not needed for data with only categorical features.
In findSplitsBins: splits and bins are no longer pre-computed for ordered categorical features since they are not needed.
TreePoint binning is simpler for categorical features.

(4) Changed nodes to use 1-indexing [SPARK-3086]

Nodes used to be indexed from 0. Now they are indexed from 1.
Node indexing functions are now collected in object Node (Node.scala).

(5) Small clean-ups

Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
Eliminated InvalidBinIndex since it is no longer used.

CC: @mengxr @manishamde Please let me know if you have thoughts on this—thanks!

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)

Optimization: Added TreePoint representation so we only call findBin once for each example, feature. Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones. BUG FIX: isSampleValid * isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. Also: some commented-out debugging println calls in DecisionTree, to be removed later

DecisionTree.scala * Eliminated filters, replaced by building tree on the fly and filtering top-down. ** Aggregation over examples now skips examples which do not reach the current level. * Only calculate unorderedFeatures once (in findSplitsBins) Node: Renamed predictIfLeaf to predict Bin, Split: Updated doc

* Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification

* Created ImpurityAggregator classes, rather than old aggregates. * Feature split/bin semantics are based on ordered vs. unordered ** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features. * numBins can differ for each feature DecisionTree * Major changes based on new aggregator setup ** For ordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(binIndex). ** For unordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(2 * binIndex), * Added LearningMetadata class * Eliminated now-unused functions: ** extractNodeInfo ** getNumSplitsForFeature ** getBinDataForNode (Eliminated since it merely slices/reshapes data.) ImpurityAggregator classes * Changed main aggregate operation to create binAggregates (binSeqOp, binCompOp) to use the aggregator. * Before, for unordered features, the left/right bins were treated as a separate dimension for aggregates. They are now part of the bins: binAggregates is of size: (numNodes, numBins_f, numFeatures) where numBins_f is: ** 2 * [pow(2, maxFeatureValue - 1) - 1] for unordered categorical features ** maxFeatureValue for ordered categorical features ** maxBins for continuous features DecisionTreeSuite * For tests using unordered (low-arity) features, removed checks of Bin.category, which only has meaning for ordered features.

…s incorrect since counts were swapped accidentally

…l features * Bug: Categorical features were all treated as ordered for binary classification. This is possible but would require the bin ordering to be determined on-the-fly after the aggregation. Currently, the ordering is determined a priori and fixed for all splits. * (Temp) Fix: Treat low-arity categorical features as unordered for binary classification. * Related change: I removed most tests for isMulticlass in the code. I instead test metadata for whether there are unordered features. * Status: The bug may be fixed, but more testing needs to be done. Aggregates: The same binMultiplier (for ordered vs. unordered) was applied to all features. It is now applied on a per-feature basis.

…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable

* Updated doc * Made some methods private Changed timer to report time in seconds.

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

…ng println calls in DecisionTree.scala.

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

…eaned up DT Suite some.

…xing functions to Node.

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

jkbradley · 2014-09-03T05:58:48Z

@mengxr I agree memory limits are a problem.

I am OK with changing maxBins to 32. For maxMemoryInMB, does 256 seem reasonable when the default executor memory is 512?
For choosing the best splits, I agree we could distribute it, but I'm not sure about the gains. The memory requirements might be hard to reduce given the current setup, for each executor needs to update bins for all nodes during its one pass over its data points. If we maintained a mapping from nodes to relevant examples, then I could imagine spawning one job per node; that would reduce the memory requirement but might mean lots more passes over the data. As far as the driver's load, fairly little time is spent outside of aggregation, so I don't think it's a big issue. Am I misunderstanding though?

I could imagine 2 main ways to reduce memory usage without doing more passes over the data:
(a) Simple way: We could use different types to compress data, as others have done.
(b) Fancy way: We could use many fewer maxBins by default but re-bin features at each node to ameliorate the effects of a small number of bins. E.g., the top node might split a continuous feature into bins [0, 20], [20, 40], ... and choose to split at 20; in the next level, the left node might use bins [0, 5], [5, 10], ... [15, 20], and the right node might use bins [20, 30], [30, 40], ....

…iplied by statsSize)

SparkQA · 2014-09-03T19:19:31Z

QA tests have started for PR 2125 at commit aa4e4df.

This patch merges cleanly.

SparkQA · 2014-09-03T20:12:42Z

QA tests have finished for PR 2125 at commit aa4e4df.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("isMulticlass = " + metadata.isMulticlass)
- logDebug("isMulticlass = " + metadata.isMulticlass)

jkbradley · 2014-09-03T23:29:56Z

Jenkins, please test this again.

mengxr · 2014-09-05T06:44:21Z

retest this please

mengxr · 2014-09-05T07:33:46Z

Jenkins, retest this please.

mengxr · 2014-09-05T20:29:28Z

Jenkins, retest this please.

…nd isUnordered as arrays)

SparkQA · 2014-09-05T23:53:47Z

QA tests have started for PR 2125 at commit a2acea5.

This patch merges cleanly.

SparkQA · 2014-09-06T01:13:21Z

QA tests have finished for PR 2125 at commit a2acea5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2014-09-06T01:50:54Z

Looks like other tests are failing. I might as well push one more tweak (speedup tweak based on profiling).

SparkQA · 2014-09-06T02:43:28Z

QA tests have started for PR 2125 at commit 42c192a.

This patch merges cleanly.

SparkQA · 2014-09-06T03:49:36Z

QA tests have finished for PR 2125 at commit 42c192a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("isMulticlass = " + metadata.isMulticlass)
- logDebug("isMulticlass = " + metadata.isMulticlass)

mengxr · 2014-09-08T16:47:09Z

LGTM. Merged into master. Thanks!

Adjust the default values of decision tree, based on the memory requirement discussed in #2125 : 1. maxMemoryInMB: 128 -> 256 2. maxBins: 100 -> 32 3. maxDepth: 4 -> 5 (in some example code) jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #2322 from mengxr/tree-defaults and squashes the following commits: cda453a [Xiangrui Meng] fix tests 5900445 [Xiangrui Meng] update comments 8c81831 [Xiangrui Meng] update default values of tree:

Adjust the default values of decision tree, based on the memory requirement discussed in apache/spark#2125 : 1. maxMemoryInMB: 128 -> 256 2. maxBins: 100 -> 32 3. maxDepth: 4 -> 5 (in some example code) jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #2322 from mengxr/tree-defaults and squashes the following commits: cda453a [Xiangrui Meng] fix tests 5900445 [Xiangrui Meng] update comments 8c81831 [Xiangrui Meng] update default values of tree:

jkbradley added 30 commits August 5, 2014 11:17

timing for DecisionTree internals

a95bc22

Merge remote-tracking branch 'upstream/master' into dt-timing

511ec85

Merge remote-tracking branch 'upstream/master' into dt-timing

bcf874a

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

Merge remote-tracking branch 'upstream/master' into dt-timing

f61e9d2

Optimizing DecisionTree

3211f02

* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)

Merge remote-tracking branch 'upstream/master' into dt-opt

b2ed1f3

Small DecisionTree updates:

c1565a5

* Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification

Fixed bug introduced by last commit: Variance impurity calculation wa…

51ef781

…s incorrect since counts were swapped accidentally

Added stuff fro mnist8m to D T Runner

e3c84cc

added cache to DT input

86e217f

removed subsampling for mnist8m from DT

438a660

Merge remote-tracking branch 'upstream/master' into dt-opt1

a87e08f

Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. R…

8464a6e

…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable

TreePoint

e66f1b1

* Updated doc * Made some methods private Changed timer to report time in seconds.

Print timing info to logDebug.

d036089

Added more debug info on binning error. Added some docs.

430d782

Merge branch 'dt-opt1' into dt-opt2

356daba

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

Removed tree/model/Filter.scala since no longer used. Removed debuggi…

26d10dd

…ng println calls in DecisionTree.scala.

Merge branch 'dt-opt2' into dt-opt3

5fce635

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

partial merge, not yet done

45f7ea7

partial merge but not done yet

9c83363

Merge remote-tracking branch 'upstream/master' into dt-opt3

b314659

Merge remote-tracking branch 'upstream/master' into dt-opt3

3ba7166

Fixed bugs from merge: missing DT timer call, and numBins setting. Cl…

61c4509

…eaned up DT Suite some.

Added treeAggregate since not yet merged from master. Moved node inde…

5f94342

…xing functions to Node.

Merge remote-tracking branch 'upstream/master' into dt-opt3

95cad7c

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

Updated DTStatsAggregator with bug fix (nodeString should not be mult…

aa4e4df

…iplied by statsSize)

Small optimizations based on profiling

a2acea5

jkbradley added 2 commits September 5, 2014 14:26

Merge remote-tracking branch 'upstream/master' into rfs

425716c

optimization for TreePoint construction (pre-computing featureArity a…

00e4404

…nd isUnordered as arrays)

jkbradley added 2 commits September 5, 2014 18:48

Merge remote-tracking branch 'upstream/master' into dt-opt3alt

d3cc46b

Merge branch 'rfs' into dt-opt3alt

42c192a

asfgit closed this in 711356b Sep 8, 2014

mengxr mentioned this pull request Sep 8, 2014

[SPARK-3443][MLLIB] update default values of tree: #2322

Closed

jkbradley deleted the dt-opt3alt branch October 8, 2014 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements #2125

[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements #2125

Uh oh!

jkbradley commented Aug 26, 2014

Uh oh!

jkbradley commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

jkbradley commented Sep 3, 2014

Uh oh!

mengxr commented Sep 5, 2014

Uh oh!

mengxr commented Sep 5, 2014

Uh oh!

mengxr commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

jkbradley commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

mengxr commented Sep 8, 2014

Uh oh!

Uh oh!

[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements #2125

[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements #2125

Uh oh!

Conversation

jkbradley commented Aug 26, 2014

Uh oh!

jkbradley commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

jkbradley commented Sep 3, 2014

Uh oh!

mengxr commented Sep 5, 2014

Uh oh!

mengxr commented Sep 5, 2014

Uh oh!

mengxr commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

jkbradley commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

mengxr commented Sep 8, 2014

Uh oh!

Uh oh!