-
Notifications
You must be signed in to change notification settings - Fork 28.6k
SPARK-1509: add zipWithIndex zipWithUniqueId methods to java api #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
@Test | ||
public void zipWithUniqueId() { | ||
List<Integer> correct = Arrays.asList(1, 2, 3, 4); | ||
JavaPairRDD<Integer, Long> zip = sc.parallelize(correct).zipWithUniqueId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should test with more than one partitions.
List<Integer> dataArray = Arrays.asList(1, 2, 3, 4); | ||
JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithIndex(); | ||
JavaRDD<Long> indexes = zip.values(); | ||
HashSet<Long> correctIndexes = new HashSet<Long>(Arrays.asList(0l, 1l, 2l, 3l)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use a list or an Array instead of a set here, because you want to assert on the exact order.
Also, use L
instead of l
.
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
* 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method | ||
* won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]]. | ||
*/ | ||
def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw this. Why do you need [Long]
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When remove the [Long]
. The type of return value is JavaPairRDD<T,Object>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def zipWithUniqueId(): JavaPairRDD[T, Long]
would return JavaPairRDD<T, Object>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,in my test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr already found this out - but the reason is you'd want to declare the type as java.lang.Double instead of Long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically what you created here is a type parameter named "Long" (surprisingly not a keyword in Scala), and you got the compiler to infer the type when you were calling it from Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try:
def zipWithUniqueId(): JavaPairRDD[T, java.lang.Long] = {
JavaPairRDD.fromRDD(rdd.zipWithUniqueId().map(x => (x._1, new java.lang.Long(x._2))))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def zipWithUniqueId(): JavaPairRDD[T, JLong] = {
JavaPairRDD.fromRDD(rdd.zipWithUniqueId()).asInstanceOf[JavaPairRDD[T, JLong]]
}
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just put java.lang.Long. It is not that "long" anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if Jenkins is happy. |
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Thanks. I've merged this. |
Author: witgo <witgo@qq.com> Closes #423 from witgo/zipWithIndex and squashes the following commits: 039ec04 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex 24d74c9 [witgo] review commit 763a5e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex 59747d1 [witgo] review commit 7bf4d06 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex daa8f84 [witgo] review commit 4070613 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex 18e6c97 [witgo] java api zipWithIndex test 11e2e7f [witgo] add zipWithIndex zipWithUniqueId methods to java api (cherry picked from commit 7d15058) Signed-off-by: Reynold Xin <rxin@apache.org>
Improving the graphx-programming-guide This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide.
Author: witgo <witgo@qq.com> Closes apache#423 from witgo/zipWithIndex and squashes the following commits: 039ec04 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex 24d74c9 [witgo] review commit 763a5e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex 59747d1 [witgo] review commit 7bf4d06 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex daa8f84 [witgo] review commit 4070613 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex 18e6c97 [witgo] java api zipWithIndex test 11e2e7f [witgo] add zipWithIndex zipWithUniqueId methods to java api
Improving the graphx-programming-guide This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide. (cherry picked from commit 3fcc68b) Signed-off-by: Reynold Xin <rxin@apache.org>
…lish-fix Ensure bintray upload happens before repository is no clean.
No description provided.