Skip to content

SPARK-1509: add zipWithIndex zipWithUniqueId methods to java api #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

witgo
Copy link
Contributor

@witgo witgo commented Apr 16, 2014

No description provided.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@witgo witgo changed the title SPARK-1509: add zipWithIndex zipWithUniqueId methods to java api [WIP]SPARK-1509: add zipWithIndex zipWithUniqueId methods to java api Apr 16, 2014
@witgo witgo changed the title [WIP]SPARK-1509: add zipWithIndex zipWithUniqueId methods to java api SPARK-1509: add zipWithIndex zipWithUniqueId methods to java api Apr 17, 2014
@Test
public void zipWithUniqueId() {
List<Integer> correct = Arrays.asList(1, 2, 3, 4);
JavaPairRDD<Integer, Long> zip = sc.parallelize(correct).zipWithUniqueId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should test with more than one partitions.

List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithIndex();
JavaRDD<Long> indexes = zip.values();
HashSet<Long> correctIndexes = new HashSet<Long>(Arrays.asList(0l, 1l, 2l, 3l));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use a list or an Array instead of a set here, because you want to assert on the exact order.

Also, use L instead of l.

@mengxr
Copy link
Contributor

mengxr commented Apr 24, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14428/

* 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
* won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
*/
def zipWithUniqueId[Long](): JavaPairRDD[T, Long] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this. Why do you need [Long] here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When remove the [Long]. The type of return value is JavaPairRDD<T,Object>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def zipWithUniqueId(): JavaPairRDD[T, Long]

would return JavaPairRDD<T, Object>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,in my test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr already found this out - but the reason is you'd want to declare the type as java.lang.Double instead of Long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically what you created here is a type parameter named "Long" (surprisingly not a keyword in Scala), and you got the compiler to infer the type when you were calling it from Java.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try:

  def zipWithUniqueId(): JavaPairRDD[T, java.lang.Long] = {
    JavaPairRDD.fromRDD(rdd.zipWithUniqueId().map(x => (x._1, new java.lang.Long(x._2))))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  def zipWithUniqueId(): JavaPairRDD[T, JLong] = {
    JavaPairRDD.fromRDD(rdd.zipWithUniqueId()).asInstanceOf[JavaPairRDD[T, JLong]]
  } 

is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just put java.lang.Long. It is not that "long" anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin You're right, has been modified.

@mengxr

 def zipWithUniqueId(): JavaPairRDD[T, java.lang.Long] = {
    JavaPairRDD.fromRDD(rdd.zipWithUniqueId().map(x => (x._1, new java.lang.Long(x._2)))

create too many objects.

@mengxr
Copy link
Contributor

mengxr commented Apr 29, 2014

LGTM if Jenkins is happy.

@mengxr
Copy link
Contributor

mengxr commented Apr 29, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14561/

@rxin
Copy link
Contributor

rxin commented Apr 29, 2014

Thanks. I've merged this.

asfgit pushed a commit that referenced this pull request Apr 29, 2014
Author: witgo <witgo@qq.com>

Closes #423 from witgo/zipWithIndex and squashes the following commits:

039ec04 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
24d74c9 [witgo] review commit
763a5e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
59747d1 [witgo] review commit
7bf4d06 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
daa8f84 [witgo] review commit
4070613 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
18e6c97 [witgo] java api zipWithIndex test
11e2e7f [witgo] add zipWithIndex zipWithUniqueId methods to java api

(cherry picked from commit 7d15058)
Signed-off-by: Reynold Xin <rxin@apache.org>
@asfgit asfgit closed this in 7d15058 Apr 29, 2014
@witgo witgo deleted the zipWithIndex branch April 30, 2014 01:37
pwendell pushed a commit to pwendell/spark that referenced this pull request May 12, 2014
Improving the graphx-programming-guide

This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide.
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
Author: witgo <witgo@qq.com>

Closes apache#423 from witgo/zipWithIndex and squashes the following commits:

039ec04 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
24d74c9 [witgo] review commit
763a5e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
59747d1 [witgo] review commit
7bf4d06 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
daa8f84 [witgo] review commit
4070613 [witgo] Merge branch 'master' of https://github.com/apache/spark into zipWithIndex
18e6c97 [witgo] java api zipWithIndex test
11e2e7f [witgo] add zipWithIndex zipWithUniqueId methods to java api
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
Improving the graphx-programming-guide

This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide.

(cherry picked from commit 3fcc68b)
Signed-off-by: Reynold Xin <rxin@apache.org>
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017
mccheah added a commit to mccheah/spark that referenced this pull request Nov 28, 2018
…lish-fix

Ensure bintray upload happens before repository is no clean.
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants