-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-2314][SQL] Override collect and take in python library, and count in java library, with optimized versions. #1592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
Thanks for working on this! We'll need to coordinate merging with #1346 and related PRs. (cc @yhuai) @JoshRosen can you look at the other pyspark changes? |
Sure, I’m fine with reworking based on other changes (it seems that some merge conflicts have already cropped up in master since I submitted my PR last week). I think my change set is a little simpler than the one you linked to, so would it make sense for me to wait until that one goes in? I also thought I’d add a couple of notes on what I had in mind with this patch:
|
Yeah, I'm hoping to merge #1346 as soon as it passes Jenkins, so I'd wait for that.
Can you add these notes to the PR description so that they get included in the commit message? |
Sure, added the notes. |
Hi, I've updated the patch to work with the new code in master. |
cc @davies |
Sure, no problem. |
@staple could you rebase this pr to PR-1598? it gets very close to merge. |
This seems to have captured a bunch of unrelated changes during the rebase. |
Sorry, I'm away from home and had limited time / access to try and do the merge last night - which I didn't finish, and as you mentioned messed up the included commits. I'll post an explicit comment here when the merge is ready. |
Hi folks I’ve merged with the most recent code (pushed to my branch), but with the most recent merge I am getting NPEs in Kryo for schemas containing array data type fields in the sql.py tests. I’m away from home, with no real dev system and spotty internet access until Thursday, so unfortunately I think it’s impractical for me to diagnose the problem until then. Sorry for the delay. |
Hey @staple if you have time to update this would be great to include it. |
Can one of the admins verify this patch? |
Sorry for the delay. I wrote a simple test case in pyspark to trigger the kryo NPEs I've started seeing, that doesn't require my patch. I tested against master at ba5bcad. from pyspark.sql import Row, SQLContext The kryo NPE is coming from the TaskResultGetter: 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result @davies, do you have any insight as to the cause? |
@marmbrus This seems that JListWrapper() can not be serialized by kryo correctly, but I don't know how to reproduce it in Scala. Reproduce this bug in one line Python:
|
Hmm, that kryo error is unfortunate. We'll probably need to add a special serializer in SparkSqlSerializer to work around this. |
#2323 seems to fix the Kryo error for me. Let me know if you have further issues after updating this to master. |
QA tests have started for PR 1592. This patch merges cleanly. |
Great! I'll go ahead and merge once #2323 is in. |
I just merged #2323 to master. |
Thanks!, I'm not getting any more NPEs now. I went ahead and merged. |
QA tests have started for PR 1592 at commit
|
QA tests have finished for PR 1592 at commit
|
Hi, the failing test was CheckpointSuite / 'recovery with file input stream'. This test passed when I ran the tests locally, and it sometimes fails spuriously according to this ticket: 'flaky test case in streaming.CheckpointSuite' |
Also, I'm assuming I don't have permission to ask jenkins to run a test myself, right? |
@staple The Jenkins pull request builder is in an odd state of flux right now. I've manually re-triggered your build (I should have self-service "retest this please" working more consistently by sometime next week). |
QA tests have started for PR 1592 at commit
|
QA tests have started for PR 1592 at commit
|
@JoshRosen Great, thanks for your help! |
QA tests have finished for PR 1592 at commit
|
QA tests have finished for PR 1592 at commit
|
SQL LGTM. @JoshRosen is this ready to go? |
@@ -36,6 +37,65 @@ | |||
|
|||
from py4j.java_collections import ListConverter | |||
|
|||
__all__ = ["JavaStackTrace", "SparkContext"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file sets __all__
further down on line 100; we should only set it once. Actually, do you mind just moving the extract_concise_traceback
stuff to its own file? There's actually an open JIRA ticket for this (SPARK-1087) and if you're going to touch the traceback code anyways now seems like a good time to do this refactoring. You could name the file something like traceback_utils.py
.
Thanks for taking a look, guys. Hmm, it looks like the duplicate |
Now that #2385 has been merged, this looks like it will be ready to merge as soon as you rebase it on top of master. |
Ok, just merged with master. |
Thanks! I've merged this to master. |
Great! Thanks |
…ache#1592) This fixes two Spark test failures related to Boson: - `ParquetEncryptionSuite` is failing in Boson because `org.apache.parquet.crypto.keytools.mocks.InMemoryKMS` is not included in the shaded `boson-spark` jar - `HadoopFsRelationTest` is failing because it generates date time that's before the cutoff date of Proleptic Gregorian calendar.
SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer. The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well. This patch overrides relevant calls in the python and java interfaces with optimized versions.
Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code.