-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize. #13890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…liminate serialize/deserialize.
Test build #61169 has finished for PR 13890 at commit
|
- because the Encoder for Java Beans needs not only getters but also setters.
Test build #61176 has finished for PR 13890 at commit
|
@marmbrus Could you review or assign someone to review this pr please? |
/cc @cloud-fan |
} | ||
|
||
/** Physical plan node for scanning data from an RDD. */ | ||
private[sql] case class ExistingRDDScanExec[T]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the name it's hard to tell what's the difference between this one and RDDScanExec
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming RDDScanExec
to LogicalRDDScanExec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good idea here, cc @yhuai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogicalRDDScanExec
sounds weird because Logical
means that it is a logical node but Exec
means that it is physical node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about ExternalRDDScan?
it's a pretty good optimization! should we also apply it to |
Hmm, I think we can't apply it to |
Test build #61950 has finished for PR 13890 at commit
|
@@ -74,13 +74,71 @@ object RDDConversions { | |||
} | |||
} | |||
|
|||
private[sql] object ExternalRDD { | |||
|
|||
def apply[T: Encoder](rdd: RDD[T])(session: SparkSession): LogicalPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why curry here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I wanted to make the signature similar to the case class constructor.
Should I uncurry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure why we curry the constructor either. Since RDDScan
does it, it's ok we follow it. But for this apply method
, I don't see the value of doing it. cc @yhuai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably no reason to use multiple parameter lists here. We sometimes use it for case classes so that arguments that should not effect equality are not included in the generated equals
method.
LGTM, cc @liancheng to take another look |
Test build #62008 has finished for PR 13890 at commit
|
retest this please |
will merge it once tests pass, thanks for working on it! |
Test build #62150 has finished for PR 13890 at commit
|
merging to master! |
@cloud-fan Thank you for merging this! |
What changes were proposed in this pull request?
Currently the input
RDD
ofDataset
is always serialized toRDD[InternalRow]
prior to being asDataset
, but there is a case that we usemap
ormapPartitions
just after converted toDataset
.In this case, serialize and then deserialize happens but it would not be needed.
This pr adds
ExistingRDD
logical plan for input withRDD
to have a chance to eliminate serialize/deserialize.How was this patch tested?
Existing tests.