-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-12026] [MLlib] ChiSqTest gets slower and slower over time when number of features is large #10146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #47199 has finished for PR 10146 at commit
|
features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) => | ||
allDistinctFeatures(col) += feature | ||
(col, feature, label) | ||
features.toArray.slice(startCol, endCol).zip(startCol until endCol).map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slice
is going to make a copy, and then zip
will make another one. If the goal is to improve performance, I think it makes the most sense to iterate over the range you want directly:
val arr = features.toArray
(startCol until endCol).map { col =>
val feaure = arr(col)
allDistinctFeatures(col) += feature
(col, feature, label)
}
or even go a step further and pre-allocate the result array, to avoid all the copying from dynamically growing it.
also toArray
is going to be horrible on sparse vectors -- it might make more sense to use toBreeze
, which won't create so much wasted space (though still suboptimal looping). Better would be special handling. But that is independent from the main issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @squito for the great suggestion. I would wait for a while to see if there're other comments.
Test build #47263 has finished for PR 10146 at commit
|
lgtm |
features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) => | ||
val featureArray = features.toArray | ||
(startCol until endCol).map { col => | ||
val feature = featureArray(col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, good catch about the view that builds incrementally a copy of the feature vectors.
Let's also remove the featureArray
, we can call directly val feature = features(col)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 since that could avoid a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thunterdb @jkbradley Thanks for the comment. Please correct if I'm wrong. For sparse vector, I'm afraid features(col)
will not be efficient given the current implementation of SparseVector
.
def apply(i: Int): Double = toBreeze(i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting to Breeze should be an O(1) operation, using a reference copy, not a deep copy. Breeze uses binary search on the indices, so it should be fairly efficient. I think it's better than converting to a dense array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley I changed it to invoking toBreeze directly. I ran some test on local.
val sv = new SparseVector(100000, Array(1), Array(2.5))
var t = 0.0
{
val ts = System.nanoTime()
val arr = sv.toArray
for(i <- 0 to 10000){
t = arr(i)
}
println(System.nanoTime() - ts)
}
{
val ts = System.nanoTime()
for(i <- 0 to 10000){
t = sv(i)
}
println(System.nanoTime() - ts)
}
{
val ts = System.nanoTime()
val brz = sv.toBreeze
for(i <- 0 to 10000){
t = brz(i)
}
println(System.nanoTime() - ts)
}
gives
3405342
520367839
4870954
Let me know how do you think of it.
@hhbyyh thanks for the fix; I just have one small comment. |
@jkbradley I changed it to invoking toBreeze directly. I ran some test on local.
gives The third way should be memory friendly as it does not create dense array or many breeze objects. Let me know how do you think of it. |
Failed to fetch from https://github.com/apache/spark.git. |
@hhbyyh Thanks for doing that test. Let's go with option 3 as you suggested. |
Test build #2365 has finished for PR 10146 at commit
|
@hhbyyh yes, option 3 sounds good. A caveat, though, about the numbers you posted: micro benchmarks on the JVM are very hard to get right, and a simple loop is not considered a good practice in general. I recommend you use a framework like JMH. There are some scala wrappers for it. |
LGTM |
…number of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq. (cherry picked from commit 021dafc) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@thunterdb Thanks a lot for the recommendation. I'll try with it. |
jira: https://issues.apache.org/jira/browse/SPARK-12026
The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.
I tested on local and the change can improve the performance and the running time was stable.