-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-8998][MLlib] Collect enough frequent prefixes before local processing in PrefixSpan (new) #7412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala
Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete PrefixspanSuite.scala.
Initilize local master branch.
@mengxr This is new PR, please review it. TKS. |
Test build #37309 has finished for PR 7412 at commit
|
allPatterns | ||
|
||
var patternsCount = lengthOnePatternsAndCounts.length | ||
var allPatternAndCounts = sequences.sparkContext.parallelize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to parallelize
if you remove the collect
on L88 (will still need collect
on L90 for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
getPatternCountsAndPrefixSuffixPairs(minCount, largePrefixSuffixPairs) | ||
patternsCount = nextPatternAndCounts.count() | ||
largePrefixSuffixPairs.unpersist() | ||
val splitedPrefixSuffixPairs = splitPrefixSuffixPairs(nextPrefixSuffixPairs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: split**_t**_ed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, instead of ._1
and ._2
below, why not just assign like in L94 here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Made some suggestions; see how perf changes after them. Unfortunately, scanning the dataset to ensure suffixes are bounded will introduce a performance hit. I still think it's worth it though since it's certainly better than just failin. It may be worthwhile to test that these changes prevent executor failure due to overload. One way to do that would be to use a large enough dataset and set |
@feynmanliang About splitPrefixSuffixPairs, I compared these two methods. I find your method's running time more than mine. And the result is not correct. I don't know why it was so, please check it, thank you. |
@feynmanliang You are right, it is worth to prevent executor failure. I very much agree with it. And I tested it according to your suggestion. |
Test build #38502 has finished for PR 7412 at commit
|
[Spark-8998]Collect Enough Prefixes Improvements
Test build #38791 has finished for PR 7412 at commit
|
* The maximum number of items allowed in a projected database before local processing. If a | ||
* projected database exceeds this size, another iteration of distributed PrefixSpan is run. | ||
*/ | ||
private val maxLocalProjDBSize: Long = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave a TODO to make it configurable with a better default value. 10000
may be too small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@zhangjiajin @feynmanliang I made some minor comments. We can address the default value of |
Collect enough frequent prefixes before projection in PrefixSpan