Skip to content

[SPARK-12026] [MLlib] ChiSqTest gets slower and slower over time when number of features is large #10146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Dec 4, 2015

jira: https://issues.apache.org/jira/browse/SPARK-12026

The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.

I tested on local and the change can improve the performance and the running time was stable.

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47199 has finished for PR 10146 at commit 1b1a0c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) =>
allDistinctFeatures(col) += feature
(col, feature, label)
features.toArray.slice(startCol, endCol).zip(startCol until endCol).map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice is going to make a copy, and then zip will make another one. If the goal is to improve performance, I think it makes the most sense to iterate over the range you want directly:

val arr = features.toArray
(startCol until endCol).map { col =>
  val feaure = arr(col)
  allDistinctFeatures(col) += feature
  (col, feature, label)
}

or even go a step further and pre-allocate the result array, to avoid all the copying from dynamically growing it.

also toArray is going to be horrible on sparse vectors -- it might make more sense to use toBreeze, which won't create so much wasted space (though still suboptimal looping). Better would be special handling. But that is independent from the main issue here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @squito for the great suggestion. I would wait for a while to see if there're other comments.

@SparkQA
Copy link

SparkQA commented Dec 7, 2015

Test build #47263 has finished for PR 10146 at commit 8d8327d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Dec 7, 2015

lgtm

features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) =>
val featureArray = features.toArray
(startCol until endCol).map { col =>
val feature = featureArray(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, good catch about the view that builds incrementally a copy of the feature vectors.

Let's also remove the featureArray, we can call directly val feature = features(col)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 since that could avoid a copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thunterdb @jkbradley Thanks for the comment. Please correct if I'm wrong. For sparse vector, I'm afraid features(col) will not be efficient given the current implementation of SparseVector.
def apply(i: Int): Double = toBreeze(i)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting to Breeze should be an O(1) operation, using a reference copy, not a deep copy. Breeze uses binary search on the indices, so it should be fairly efficient. I think it's better than converting to a dense array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley I changed it to invoking toBreeze directly. I ran some test on local.

    val sv = new SparseVector(100000, Array(1), Array(2.5))
    var t = 0.0

    {
      val ts = System.nanoTime()
      val arr = sv.toArray
      for(i <- 0 to 10000){
        t = arr(i)
      }
      println(System.nanoTime() - ts)
    }

    {
      val ts = System.nanoTime()
      for(i <- 0 to 10000){
        t = sv(i)
      }
      println(System.nanoTime() - ts)
    }

    {
      val ts = System.nanoTime()
      val brz = sv.toBreeze
      for(i <- 0 to 10000){
        t = brz(i)
      }
      println(System.nanoTime() - ts)
    }

gives
3405342
520367839
4870954

Let me know how do you think of it.

@thunterdb
Copy link
Contributor

@hhbyyh thanks for the fix; I just have one small comment.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 12, 2016

@jkbradley I changed it to invoking toBreeze directly. I ran some test on local.

val sv = new SparseVector(100000, Array(1), Array(2.5))
var t = 0.0

{
  val ts = System.nanoTime()
  val arr = sv.toArray
  for(i <- 0 to 10000){
    t = arr(i)
  }
  println(System.nanoTime() - ts)
}

{
  val ts = System.nanoTime()
  for(i <- 0 to 10000){
    t = sv(i)
  }
  println(System.nanoTime() - ts)
}

{
  val ts = System.nanoTime()
  val brz = sv.toBreeze
  for(i <- 0 to 10000){
    t = brz(i)
  }
  println(System.nanoTime() - ts)
}

gives
3405342
520367839
4870954

The third way should be memory friendly as it does not create dense array or many breeze objects. Let me know how do you think of it.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 12, 2016

Failed to fetch from https://github.com/apache/spark.git.
need a retest

@jkbradley
Copy link
Member

@hhbyyh Thanks for doing that test. Let's go with option 3 as you suggested.

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #2365 has finished for PR 10146 at commit a709f49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@thunterdb
Copy link
Contributor

@hhbyyh yes, option 3 sounds good.

A caveat, though, about the numbers you posted: micro benchmarks on the JVM are very hard to get right, and a simple loop is not considered a good practice in general. I recommend you use a framework like JMH. There are some scala wrappers for it.
http://openjdk.java.net/projects/code-tools/jmh/

@jkbradley
Copy link
Member

LGTM
Merging with master and branch-1.6
Thanks for the fix!

asfgit pushed a commit that referenced this pull request Jan 14, 2016
…number of features is large

jira: https://issues.apache.org/jira/browse/SPARK-12026

The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.

I tested on local and the change can improve the performance and the running time was stable.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10146 from hhbyyh/chiSq.

(cherry picked from commit 021dafc)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 021dafc Jan 14, 2016
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 14, 2016

@thunterdb Thanks a lot for the recommendation. I'll try with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants