Skip to content

[SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize. #13890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jun 24, 2016

What changes were proposed in this pull request?

Currently the input RDD of Dataset is always serialized to RDD[InternalRow] prior to being as Dataset, but there is a case that we use map or mapPartitions just after converted to Dataset.
In this case, serialize and then deserialize happens but it would not be needed.

This pr adds ExistingRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61169 has finished for PR 13890 at commit 036d5fd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

- because the Encoder for Java Beans needs not only getters but also setters.
@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61176 has finished for PR 13890 at commit 62aca29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 4, 2016

@marmbrus Could you review or assign someone to review this pr please?

@marmbrus
Copy link
Contributor

marmbrus commented Jul 5, 2016

/cc @cloud-fan

}

/** Physical plan node for scanning data from an RDD. */
private[sql] case class ExistingRDDScanExec[T](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the name it's hard to tell what's the difference between this one and RDDScanExec...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming RDDScanExec to LogicalRDDScanExec ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good idea here, cc @yhuai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogicalRDDScanExec sounds weird because Logical means that it is a logical node but Exec means that it is physical node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about ExternalRDDScan?

@cloud-fan
Copy link
Contributor

it's a pretty good optimization! should we also apply it to LocalRelation?

@ueshin
Copy link
Member Author

ueshin commented Jul 6, 2016

Hmm, I think we can't apply it to LocalRelation because the data in LocalRelation might not be Java or Kryo serializable. We need the data being serializable to create RDD.

@ueshin ueshin changed the title [SPARK-16189][SQL] Add ExistingRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize. [SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize. Jul 8, 2016
@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61950 has finished for PR 13890 at commit e218f5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -74,13 +74,71 @@ object RDDConversions {
}
}

private[sql] object ExternalRDD {

def apply[T: Encoder](rdd: RDD[T])(session: SparkSession): LogicalPlan = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why curry here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I wanted to make the signature similar to the case class constructor.
Should I uncurry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure why we curry the constructor either. Since RDDScan does it, it's ok we follow it. But for this apply method, I don't see the value of doing it. cc @yhuai

Copy link
Contributor

@marmbrus marmbrus Jul 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably no reason to use multiple parameter lists here. We sometimes use it for case classes so that arguments that should not effect equality are not included in the generated equals method.

@cloud-fan
Copy link
Contributor

LGTM, cc @liancheng to take another look

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62008 has finished for PR 13890 at commit 61da040.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

will merge it once tests pass, thanks for working on it!

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62150 has finished for PR 13890 at commit 61da040.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

merging to master!

@asfgit asfgit closed this in 5b28e02 Jul 12, 2016
@ueshin
Copy link
Member Author

ueshin commented Jul 12, 2016

@cloud-fan Thank you for merging this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants