Skip to content

Commit a490787

Browse files
hhbyyhjkbradley
authored andcommitted
[SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number of features is large
jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq. (cherry picked from commit 021dafc) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
1 parent 26f13fa commit a490787

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,9 @@ private[stat] object ChiSqTest extends Logging {
109109
}
110110
i += 1
111111
distinctLabels += label
112-
features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) =>
112+
val brzFeatures = features.toBreeze
113+
(startCol until endCol).map { col =>
114+
val feature = brzFeatures(col)
113115
allDistinctFeatures(col) += feature
114116
(col, feature, label)
115117
}
@@ -122,7 +124,7 @@ private[stat] object ChiSqTest extends Logging {
122124
pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct.zipWithIndex.toMap
123125
}
124126
val numLabels = labels.size
125-
pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
127+
pairCounts.keys.groupBy(_._1).foreach { case (col, keys) =>
126128
val features = keys.map(_._2).toArray.distinct.zipWithIndex.toMap
127129
val numRows = features.size
128130
val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))

0 commit comments

Comments
 (0)