[SPARK-12026] [MLlib] ChiSqTest gets slower and slower over time when number of features is large #10146

hhbyyh · 2015-12-04T14:48:44Z

jira: https://issues.apache.org/jira/browse/SPARK-12026

The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.

I tested on local and the change can improve the performance and the running time was stable.

SparkQA · 2015-12-04T15:44:33Z

Test build #47199 has finished for PR 10146 at commit 1b1a0c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-12-04T17:33:47Z

mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala

-          features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) =>
-            allDistinctFeatures(col) += feature
-            (col, feature, label)
+          features.toArray.slice(startCol, endCol).zip(startCol until endCol).map {


slice is going to make a copy, and then zip will make another one. If the goal is to improve performance, I think it makes the most sense to iterate over the range you want directly:

val arr = features.toArray (startCol until endCol).map { col => val feaure = arr(col) allDistinctFeatures(col) += feature (col, feature, label) }

or even go a step further and pre-allocate the result array, to avoid all the copying from dynamically growing it.

also toArray is going to be horrible on sparse vectors -- it might make more sense to use toBreeze, which won't create so much wasted space (though still suboptimal looping). Better would be special handling. But that is independent from the main issue here.

Thanks @squito for the great suggestion. I would wait for a while to see if there're other comments.

SparkQA · 2015-12-07T10:28:01Z

Test build #47263 has finished for PR 10146 at commit 8d8327d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-12-07T16:06:18Z

lgtm

thunterdb · 2016-01-05T01:49:55Z

mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala

-          features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) =>
+          val featureArray = features.toArray
+          (startCol until endCol).map { col =>
+            val feature = featureArray(col)


ah yes, good catch about the view that builds incrementally a copy of the feature vectors.

Let's also remove the featureArray, we can call directly val feature = features(col)

+1 since that could avoid a copy

@thunterdb @jkbradley Thanks for the comment. Please correct if I'm wrong. For sparse vector, I'm afraid features(col) will not be efficient given the current implementation of SparseVector.
def apply(i: Int): Double = toBreeze(i)

Converting to Breeze should be an O(1) operation, using a reference copy, not a deep copy. Breeze uses binary search on the indices, so it should be fairly efficient. I think it's better than converting to a dense array.

@jkbradley I changed it to invoking toBreeze directly. I ran some test on local.

val sv = new SparseVector(100000, Array(1), Array(2.5)) var t = 0.0 { val ts = System.nanoTime() val arr = sv.toArray for(i <- 0 to 10000){ t = arr(i) } println(System.nanoTime() - ts) } { val ts = System.nanoTime() for(i <- 0 to 10000){ t = sv(i) } println(System.nanoTime() - ts) } { val ts = System.nanoTime() val brz = sv.toBreeze for(i <- 0 to 10000){ t = brz(i) } println(System.nanoTime() - ts) }

gives
3405342
520367839
4870954

Let me know how do you think of it.

thunterdb · 2016-01-05T01:51:10Z

@hhbyyh thanks for the fix; I just have one small comment.

hhbyyh · 2016-01-12T02:17:28Z

@jkbradley I changed it to invoking toBreeze directly. I ran some test on local.

val sv = new SparseVector(100000, Array(1), Array(2.5))
var t = 0.0

{
  val ts = System.nanoTime()
  val arr = sv.toArray
  for(i <- 0 to 10000){
    t = arr(i)
  }
  println(System.nanoTime() - ts)
}

{
  val ts = System.nanoTime()
  for(i <- 0 to 10000){
    t = sv(i)
  }
  println(System.nanoTime() - ts)
}

{
  val ts = System.nanoTime()
  val brz = sv.toBreeze
  for(i <- 0 to 10000){
    t = brz(i)
  }
  println(System.nanoTime() - ts)
}

gives
3405342
520367839
4870954

The third way should be memory friendly as it does not create dense array or many breeze objects. Let me know how do you think of it.

hhbyyh · 2016-01-12T02:44:45Z

Failed to fetch from https://github.com/apache/spark.git.
need a retest

jkbradley · 2016-01-12T03:10:53Z

@hhbyyh Thanks for doing that test. Let's go with option 3 as you suggested.

SparkQA · 2016-01-12T03:50:53Z

Test build #2365 has finished for PR 10146 at commit a709f49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2016-01-12T15:38:35Z

@hhbyyh yes, option 3 sounds good.

A caveat, though, about the numbers you posted: micro benchmarks on the JVM are very hard to get right, and a simple loop is not considered a good practice in general. I recommend you use a framework like JMH. There are some scala wrappers for it.
http://openjdk.java.net/projects/code-tools/jmh/

jkbradley · 2016-01-14T01:42:37Z

LGTM
Merging with master and branch-1.6
Thanks for the fix!

…number of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq. (cherry picked from commit 021dafc) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

hhbyyh · 2016-01-14T01:53:54Z

@thunterdb Thanks a lot for the recommendation. I'll try with it.

avoid view

1b1a0c6

squito reviewed Dec 4, 2015
View reviewed changes

optimize loop

8d8327d

thunterdb reviewed Jan 5, 2016
View reviewed changes

hhbyyh added 3 commits January 11, 2016 16:07

Merge remote-tracking branch 'upstream/master' into chiSq

2fb3903

Merge remote-tracking branch 'upstream/master' into chiSq

2a4f716

use to breeze for features

a709f49

asfgit closed this in 021dafc Jan 14, 2016

[SPARK-12026] [MLlib] ChiSqTest gets slower and slower over time when number of features is large #10146

[SPARK-12026] [MLlib] ChiSqTest gets slower and slower over time when number of features is large #10146

Uh oh!

Conversation

hhbyyh commented Dec 4, 2015

Uh oh!

SparkQA commented Dec 4, 2015

Uh oh!

squito Dec 4, 2015

Choose a reason for hiding this comment

Uh oh!

hhbyyh Dec 5, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 7, 2015

Uh oh!

squito commented Dec 7, 2015

Uh oh!

thunterdb Jan 5, 2016

Choose a reason for hiding this comment

Uh oh!

jkbradley Jan 8, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

jkbradley Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh Jan 12, 2016

Choose a reason for hiding this comment

Uh oh!

thunterdb commented Jan 5, 2016

Uh oh!

hhbyyh commented Jan 12, 2016

Uh oh!

hhbyyh commented Jan 12, 2016

Uh oh!

jkbradley commented Jan 12, 2016

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

thunterdb commented Jan 12, 2016

Uh oh!

jkbradley commented Jan 14, 2016

Uh oh!

hhbyyh commented Jan 14, 2016

Uh oh!

Uh oh!