Skip to content

MLI-1 Decision Trees #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
cd53eae
skeletal framework
manishamde Nov 28, 2013
92cedce
basic building blocks for intermediate RDD calculation. untested.
manishamde Dec 2, 2013
8bca1e2
additional code for creating intermediate RDD
manishamde Dec 9, 2013
0012a77
basic stump working
manishamde Dec 10, 2013
03f534c
some more tests
manishamde Dec 10, 2013
dad0afc
decison stump functionality working
manishamde Dec 15, 2013
4798aae
added gain stats class
manishamde Dec 15, 2013
80e8c66
working version of multi-level split calculation
manishamde Dec 16, 2013
b0eb866
added logic to handle leaf nodes
manishamde Dec 16, 2013
98ec8d5
tree building and prediction logic
manishamde Dec 22, 2013
02c595c
added command line parsing
manishamde Dec 22, 2013
733d6dd
fixed tests
manishamde Dec 22, 2013
154aa77
enums for configurations
manishamde Dec 23, 2013
b0e3e76
adding enum for feature type
manishamde Jan 12, 2014
c8f6d60
adding enum for feature type
manishamde Jan 12, 2014
e23c2e5
added regression support
manishamde Jan 19, 2014
53108ed
fixing index for highest bin
manishamde Jan 20, 2014
6df35b9
regression predict logic
manishamde Jan 21, 2014
dbb7ac1
categorical feature support
manishamde Jan 23, 2014
d504eb1
more tests for categorical features
manishamde Jan 23, 2014
6b7de78
minor refactoring and tests
manishamde Jan 26, 2014
b09dc98
minor refactoring
manishamde Jan 26, 2014
c0e522b
updated predict and split threshold logic
manishamde Jan 27, 2014
f067d68
minor cleanup
manishamde Jan 27, 2014
5841c28
unit tests for categorical features
manishamde Jan 27, 2014
0dd7659
basic doc
manishamde Jan 27, 2014
dd0c0d7
minor: some docs
manishamde Jan 27, 2014
9372779
code style: max line lenght <= 100
manishamde Feb 17, 2014
84f85d6
code documentation
manishamde Feb 28, 2014
d3023b3
adding more docs for nested methods
manishamde Mar 6, 2014
63e786b
added multiple train methods for java compatability
manishamde Mar 6, 2014
cd2c2b4
fixing code style based on feedback
manishamde Mar 7, 2014
eb8fcbe
minor code style updates
manishamde Mar 7, 2014
794ff4d
minor improvements to docs and style
manishamde Mar 10, 2014
d1ef4f6
more documentation
manishamde Mar 10, 2014
ad1fc21
incorporated mengxr's code style suggestions
manishamde Mar 11, 2014
62c2562
fixing comment indentation
manishamde Mar 11, 2014
6068356
ensuring num bins is always greater than max number of categories
manishamde Mar 12, 2014
2116360
removing dummy bin calculation for categorical variables
manishamde Mar 12, 2014
632818f
removing threshold for classification predict method
manishamde Mar 13, 2014
ff363a7
binary search for bins and while loop for categorical feature bins
manishamde Mar 17, 2014
4576b64
documentation and for to while loop conversion
manishamde Mar 23, 2014
24500c5
minor style updates
mengxr Mar 23, 2014
c487e6a
Merge pull request #1 from mengxr/dtree
manishamde Mar 23, 2014
f963ef5
making methods private
manishamde Mar 23, 2014
201702f
making some more methods private
manishamde Mar 23, 2014
62dc723
updating javadoc and converting helper methods to package private to …
manishamde Mar 24, 2014
e1dd86f
implementing code style suggestions
manishamde Mar 25, 2014
f536ae9
another pass on code style
mengxr Mar 31, 2014
7d54b4f
Merge pull request #4 from mengxr/dtree
manishamde Mar 31, 2014
1e8c704
remove numBins field in the Strategy class
manishamde Apr 1, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,150 changes: 1,150 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/tree/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
This package contains the default implementation of the decision tree algorithm.

The decision tree algorithm supports:
+ Binary classification
+ Regression
+ Information loss calculation with entropy and gini for classification and variance for regression
+ Both continuous and categorical features

# Tree improvements
+ Node model pruning
+ Printing to dot files

# Future Ensemble Extensions

+ Random forests
+ Boosting
+ Extremely randomized trees
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.configuration

/**
* Enum to select the algorithm for the decision tree
*/
object Algo extends Enumeration {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Algorithm Enumeration seems redundant given Impurity which implies the Algorithm anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various Enumeration classes in mllib.tree.configuration package are neat. A uniform design pattern for parameters and options should be used for MLLib and Spark, and this could be a start. Alternatively, if there is an existing pattern in use, it should be followed for decision tree as well.

type Algo = Value
val Classification, Regression = Value
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.configuration

/**
* Enum to describe whether a feature is "continuous" or "categorical"
*/
object FeatureType extends Enumeration {
type FeatureType = Value
val Continuous, Categorical = Value
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.configuration

/**
* Enum for selecting the quantile calculation strategy
*/
object QuantileStrategy extends Enumeration {
type QuantileStrategy = Value
val Sort, MinMax, ApproxHist = Value
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.configuration

import org.apache.spark.mllib.tree.impurity.Impurity
import org.apache.spark.mllib.tree.configuration.Algo._
import org.apache.spark.mllib.tree.configuration.QuantileStrategy._

/**
* Stores all the configuration options for tree construction
* @param algo classification or regression
* @param impurity criterion used for information gain calculation
* @param maxDepth maximum depth of the tree
* @param maxBins maximum number of bins used for splitting features
* @param quantileCalculationStrategy algorithm for calculating quantiles
* @param categoricalFeaturesInfo A map storing information about the categorical variables and the
* number of discrete values they take. For example, an entry (n ->
* k) implies the feature n is categorical with k categories 0,
* 1, 2, ... , k-1. It's important to note that features are
* zero-indexed.
*/
class Strategy (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strategy should be renamed to Parameters. Modeling and algorithm parameters can be separate, the latter being part of the model.

val algo: Algo,
val impurity: Impurity,
val maxDepth: Int,
val maxBins: Int = 100,
val quantileCalculationStrategy: QuantileStrategy = Sort,
val categoricalFeaturesInfo: Map[Int,Int] = Map[Int,Int]()) extends Serializable
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.impurity

/**
* Class for calculating [[http://en.wikipedia.org/wiki/Binary_entropy_function entropy]] during
* binary classification.
*/
object Entropy extends Impurity {

def log2(x: Double) = scala.math.log(x) / scala.math.log(2)

/**
* entropy calculation
* @param c0 count of instances with label 0
* @param c1 count of instances with label 1
* @return entropy value
*/
def calculate(c0: Double, c1: Double): Double = {
if (c0 == 0 || c1 == 0) {
0
} else {
val total = c0 + c1
val f0 = c0 / total
val f1 = c1 / total
-(f0 * log2(f0)) - (f1 * log2(f1))
}
}

def calculate(count: Double, sum: Double, sumSquares: Double): Double =
throw new UnsupportedOperationException("Entropy.calculate")
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.impurity

/**
* Class for calculating the
* [[http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity Gini impurity]]
* during binary classification.
*/
object Gini extends Impurity {

/**
* Gini coefficient calculation
* @param c0 count of instances with label 0
* @param c1 count of instances with label 1
* @return Gini coefficient value
*/
override def calculate(c0: Double, c1: Double): Double = {
if (c0 == 0 || c1 == 0) {
0
} else {
val total = c0 + c1
val f0 = c0 / total
val f1 = c1 / total
1 - f0 * f0 - f1 * f1
}
}

def calculate(count: Double, sum: Double, sumSquares: Double): Double =
throw new UnsupportedOperationException("Gini.calculate")
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.impurity

/**
* Trait for calculating information gain.
*/
trait Impurity extends Serializable {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impurity should be renamed to Error or something more technical and familiar. Also see the comments earlier for the necessity and example design of a generic Error interface.

The calculate method can be renamed to something verbose like error.

For a generic interface, an additional ErrorStats trait and error(errorStats: ErrorStats) method can be added. For example, Variance or more aptly, SquareError, would implement case class SquareErrorStats(count: Long, mean: Double, meanSquare: Double) and error(errorStats) = errorStats.meanSquare - errorStats.mean * errorStats.mean / count. Note that ErrorStats should have aggregation methods, e.g., it's easy to see the implementation for SquareErrorStats.

The Variance class should be renamed to SquareError, Entropy to EntropyError or KLDivergence, Gini to GiniError.


/**
* information calculation for binary classification
* @param c0 count of instances with label 0
* @param c1 count of instances with label 1
* @return information value
*/
def calculate(c0 : Double, c1 : Double): Double
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaDoc for public methods.


/**
* information calculation for regression
* @param count number of instances
* @param sum sum of labels
* @param sumSquares summation of squares of the labels
* @return information value
*/
def calculate(count: Double, sum: Double, sumSquares: Double): Double
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easy to loss precision or run into overflow in the computation of sumSquares. Is it only for computing the sample variance? If true, we can simplify this interface to accept variance directly. We have a more stable implementation of variance computation in DoubleRDDFunctions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice observation. However, using the variance calculation in StatCounter might be slow since the merge method recomputes n, mu, m2 for each value. Also, it won't fit well with the binCombOp operation in the aggregate function. One can probably optimize the def merge(values: TraversableOnce[Double]): StatCounter method in the Variance class by doing a batch or mini-batch update for both speed and precision but that's a separate discussion.

I see your concern with computing sumSquares for a large fraction of the instances and I think it's best to leverage the def merge(other: StatCounter): StatCounter method.

We can calculate StatCounter per partition using count, sum, sumSquares and then merge during binCombOp for numerical stability. I won't be hard to implement. Let me know what you think.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Manish.

Numerical stability is the first thing that comes to mind on seeing a large avg = sum/count calculation. In practice, I haven't seen any significant difference in results or overflows with even billion sample datasets. Also, features in machine learning are typically normalized and dynamic range is small (bounded away from 0 and infinity).

We definitely cannot use the methods in DoubleRDDFunctions because we want to calculate the variance of various splits, which requires the stats to be "aggregable". But we may be able to modify the api's to use (count, avg, avgSquares) as the stats and make the calculations more stable. E.g., to merge (count, avg) of two parts (c1, a1), (c2, a2), we would have (c1 + c2, a1 * (c1/(c1+c2)) + a2 * (c2/(c1+c2))). Not too keen on that change, but let me know if that works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that overflow is an issue here (particularly in the case of sumSquares), but also agree with Manish/Hirakendu that this algorithm maintains its ability to generate a tree in a reasonable amount of time based on this property that we compute statistics for splits and then merge them together.

I actually do think it makes sense to maintain "(count, average, averageSumSq)" for each partition in a way that's overflow friendly and compute the combination as count-weighted average of both as Hirakendu suggests. This will complicate the code but should solve the overflow problem and keep things pretty efficient.

That said - maybe this could be taken care of in a future PR as a bugfix, rather than in this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major loss of precision is from sumSquares - sum*sum/count, where a large number subtracts another. Changing the interface to (count, avg, avgSquare) would help avoid overflow but has nothing to do with precision. I agree with @etrain that we can improve it in a future PR.

The question is whether we should make calculate(c0, c1) and calculate(count, sum, square) a public method of Impurity. In either classification or regression, Impurity works like an accumulator. What we need is to describe how to process a label of type Double, how to merge two Impurity instances, and how to get impurity from an instance, which is very similar to StatCounter. It is strange to see Gini only implements the first but not the second, while Variance only implements the second but not the first. We probably need to reconsider the design here. For example, if we want to handle three classes in the future, we will run into a signature collision with calculate(count, sum, squareSum).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just catching up on this, but is the problem that there will be other types of Impurity later that calculate different stats (not just variance)? In that case, maybe we can have Impurity be parameterized (Impurity[T]) where T is a type it accumulates over. However I'd also be okay with leaving this as is initially and marking the API unstable if this is an internal API. The question is how many users will call this directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I'd also be okay updating this API in a later pull request before we release 1.0. It's fair game to change new APIs in that time window.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mateiz A user needs an Impurity instance to construct Strategy, but very unlikely they need to call calculate directly or implement their own Impurity. I'm okay if we mark the calculate method unstable in another PR later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr The generic interface you noted is correct. However, I think implementing this generic interface and the corresponding implementations is not a minor code change. There are some assumptions in the bin aggregation code that may need to be updated and it also requires adding partition-wise impurity calculation and aggregation.

@mateiz As @mengxr noted, it's highly unlikely that a user will write their own Impurity implementation. It's mostly an internal API and could be addressed soon in a different PR.

I think we all agree (please correct me if I am wrong) the Impurity update belongs to a different PR. I can spend time on it immediately after this PR is accepted.

Is this the correct method of marking a method as unstable using the javadoc?
<span class="badge" style="float: right; background-color: darkblue;">ALPHA COMPONENT</span>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to the discussion on the need for a generic interface for Impurity, or more precisely Error, I believe we all see that it's good to have. Ideally I would have preferred a single Error trait and that all types of Error like Square or KL divergence extend it, but the consensus is
that it negatively impacts performance.

In addition to performance-oriented implementations for specific loss functions, I would still recommend a generic Error interface and a generic implementation of decision-tree based on this interface. One possibility is to add a third calculate(stats), or more precisely error(errorStats: ErrorStats) to the Error interface. I am not sure it will help the signature collision problem though, unless we just keep the one signature for generic error statistics.

For reference and example of one such interface and implementations, see trait LossStats[S <: LossStats[S]] and abstract class Loss[S <: LossStats[S]:Manifest] in my previous PR,
https://github.com/apache/incubator-spark/pull/161/files, that exactly do that and provide interfaces for aggregable error statistics and calculating error from these statistics. (On second thought, I feel ErrorStats and Error are better names.) Also see the generic implementation class DecisionTreeAlgorithm[S <: LossStats[S]:Manifest] and implementations of specific error functions, SquareLoss and EntropyLoss.


}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.impurity

/**
* Class for calculating variance during regression
*/
object Variance extends Impurity {
override def calculate(c0: Double, c1: Double): Double =
throw new UnsupportedOperationException("Variance.calculate")

/**
* variance calculation
* @param count number of instances
* @param sum sum of labels
* @param sumSquares summation of squares of the labels
*/
override def calculate(count: Double, sum: Double, sumSquares: Double): Double = {
val squaredLoss = sumSquares - (sum * sum) / count
squaredLoss / count
}
}
33 changes: 33 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/tree/model/Bin.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.tree.model

import org.apache.spark.mllib.tree.configuration.FeatureType._

/**
* Used for "binning" the features bins for faster best split calculation. For a continuous
* feature, a bin is determined by a low and a high "split". For a categorical feature,
* the a bin is determined using a single label value (category).
* @param lowSplit signifying the lower threshold for the continuous feature to be
* accepted in the bin
* @param highSplit signifying the upper threshold for the continuous feature to be
* accepted in the bin
* @param featureType type of feature -- categorical or continuous
* @param category categorical label value accepted in the bin
*/
case class Bin(lowSplit: Split, highSplit: Split, featureType: FeatureType, category: Double)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bin class can be simplified and some members renamed. The lowSplit, highSplit can be simplified to the single threshold corresponding to the left end of the bin range. This can be named to leftEnd or lowEnd.

It's not clear this class is needed at first place. For categorical variables, the value itself is the bin index, and for continuous variables, bins are simply defined by candidate thresholds, in turn defined by quanties. For every feature id, one can maintain a list of categories and thresholds and be done. In that case, for continuous features, the position of the threshold is the bin index.

Loading