-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements #2125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
Optimization: Added TreePoint representation so we only call findBin once for each example, feature. Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones. BUG FIX: isSampleValid * isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. Also: some commented-out debugging println calls in DecisionTree, to be removed later
DecisionTree.scala * Eliminated filters, replaced by building tree on the fly and filtering top-down. ** Aggregation over examples now skips examples which do not reach the current level. * Only calculate unorderedFeatures once (in findSplitsBins) Node: Renamed predictIfLeaf to predict Bin, Split: Updated doc
* Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
* Created ImpurityAggregator classes, rather than old aggregates. * Feature split/bin semantics are based on ordered vs. unordered ** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features. * numBins can differ for each feature DecisionTree * Major changes based on new aggregator setup ** For ordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(binIndex). ** For unordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(2 * binIndex), * Added LearningMetadata class * Eliminated now-unused functions: ** extractNodeInfo ** getNumSplitsForFeature ** getBinDataForNode (Eliminated since it merely slices/reshapes data.) ImpurityAggregator classes * Changed main aggregate operation to create binAggregates (binSeqOp, binCompOp) to use the aggregator. * Before, for unordered features, the left/right bins were treated as a separate dimension for aggregates. They are now part of the bins: binAggregates is of size: (numNodes, numBins_f, numFeatures) where numBins_f is: ** 2 * [pow(2, maxFeatureValue - 1) - 1] for unordered categorical features ** maxFeatureValue for ordered categorical features ** maxBins for continuous features DecisionTreeSuite * For tests using unordered (low-arity) features, removed checks of Bin.category, which only has meaning for ordered features.
…s incorrect since counts were swapped accidentally
…l features * Bug: Categorical features were all treated as ordered for binary classification. This is possible but would require the bin ordering to be determined on-the-fly after the aggregation. Currently, the ordering is determined a priori and fixed for all splits. * (Temp) Fix: Treat low-arity categorical features as unordered for binary classification. * Related change: I removed most tests for isMulticlass in the code. I instead test metadata for whether there are unordered features. * Status: The bug may be fixed, but more testing needs to be done. Aggregates: The same binMultiplier (for ordered vs. unordered) was applied to all features. It is now applied on a per-feature basis.
…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
…ng println calls in DecisionTree.scala.
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
…eaned up DT Suite some.
…xing functions to Node.
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@mengxr I agree memory limits are a problem.
I could imagine 2 main ways to reduce memory usage without doing more passes over the data: |
…iplied by statsSize)
QA tests have started for PR 2125 at commit
|
QA tests have finished for PR 2125 at commit
|
Jenkins, please test this again. |
retest this please |
Jenkins, retest this please. |
1 similar comment
Jenkins, retest this please. |
QA tests have started for PR 2125 at commit
|
QA tests have finished for PR 2125 at commit
|
Looks like other tests are failing. I might as well push one more tweak (speedup tweak based on profiling). |
QA tests have started for PR 2125 at commit
|
QA tests have finished for PR 2125 at commit
|
LGTM. Merged into master. Thanks! |
Adjust the default values of decision tree, based on the memory requirement discussed in #2125 : 1. maxMemoryInMB: 128 -> 256 2. maxBins: 100 -> 32 3. maxDepth: 4 -> 5 (in some example code) jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #2322 from mengxr/tree-defaults and squashes the following commits: cda453a [Xiangrui Meng] fix tests 5900445 [Xiangrui Meng] update comments 8c81831 [Xiangrui Meng] update default values of tree:
Adjust the default values of decision tree, based on the memory requirement discussed in apache/spark#2125 : 1. maxMemoryInMB: 128 -> 256 2. maxBins: 100 -> 32 3. maxDepth: 4 -> 5 (in some example code) jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #2322 from mengxr/tree-defaults and squashes the following commits: cda453a [Xiangrui Meng] fix tests 5900445 [Xiangrui Meng] update comments 8c81831 [Xiangrui Meng] update default values of tree:
Summary:
Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).
Speedups: This update helps most when many features use few bins but a few features use many bins. Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):
Details:
(1) Variable numBins for each feature [SPARK-3043]
DecisionTreeMetadata now computes a variable numBins for each feature. It also tracks numSplits.
(2) Reduced data reshaping in aggregation [SPARK-3043]
Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.
The aggregate statistics are never reshaped, and cumulative sums are computed in-place.
Updated the layout of aggregation functions. The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
For this update, the following functions were refactored:
The above 5 functions were replaced with:
Other changes:
(3) Choose ordering for ordered categorical features adaptively [SPARK-3156]
Updated binsToBestSplit():
Side effects:
(4) Changed nodes to use 1-indexing [SPARK-3086]
Nodes used to be indexed from 0. Now they are indexed from 1.
Node indexing functions are now collected in object Node (Node.scala).
(5) Small clean-ups
Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
Eliminated InvalidBinIndex since it is no longer used.
CC: @mengxr @manishamde Please let me know if you have thoughts on this—thanks!