-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-5722][SQL] fix for infer long type in python similar to Java long #4521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
Do you mind adding [SQL] to the pull request title, after [SPARK-5722]? |
Jenkins, test this please. |
Test build #27275 has started for PR 4521 at commit
|
Test build #27275 has finished for PR 4521 at commit
|
Test FAILed. |
@dondrake This looks good to me, just one comment, thanks! In Python, we infer the schema only based on the first few rows, so it's possible to get IntegerType for large Python int after this patch. In this case, user could provide a correct schema to workaround it. Could you add a comment for this? |
@rxin I updated the title of the pull request. @davies In regards to inferSchema(), this is a PR for v1.2, I'm going to submit another PR for 1.3 that will use the new DataFrame(). In regards to the comment, where would you like to see that? I understand how inferSchema() works, all rows of my dataset contained had data that was in the edge case. |
@@ -605,6 +605,10 @@ def _infer_type(obj): | |||
|
|||
dataType = _type_mappings.get(type(obj)) | |||
if dataType is not None: | |||
# Conform to Java int/long sizes SPARK-5722 | |||
if dataType == IntegerType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here.
Test build #598 has started for PR 4521 at commit
|
@dondrake After adding a comment, I think it's ready to go. |
Test build #598 has finished for PR 4521 at commit
|
OK, final changes applied. Also, I created another PR for branch-1.3 #4538 |
Jenkins, retest this please. |
Is there a reason to just not always use long (unless the user says int explicitly)? The memory savings don't seem worth the chance of throwing confusing errors or truncating. Especially since parquet / the inmemory cache use zigzag encoding. |
Test build #27303 has started for PR 4521 at commit
|
Test build #27303 has finished for PR 4521 at commit
|
Test FAILed. |
My bad, a test was accidentally pasted as part of the branch-1.3 commit. Now fixed, committed and pushed. please test again. |
BTW I think Michael has a good point - always use longs. It is cheap to do that, and a lot more robust against inference errors. This doesn't matter much in Python since there isn't compile time type anyway. |
@rxin, after thinking about it some more, I agree with @marmbrus . Let me make another change so _type mappings looks like this:
I'll remove my if statement and add some comments around that as well. This also reduces the risk of the inference seeing a small sample and choosing IntegerType. |
Thanks for doing it. @yhuai we should do the same thing for JSON type inference too. Unless we see a decimal, integer types should by default just use LongType. |
Sounds good. #4544 is for that. |
Test FAILed. |
This failure comes from my test, but it shouldn't fail when saving a Long with the exception can't convert Integer to Long.
It appears to be casting (not converting) an Integer to a Long, which you can't do. But, why does it think this is an Integer in the first place when it's defined as a LongType in Python and the spark Scala code?? I can confirm that I did see this in Spark 1.2.0 which motivated me to start this JIRA, and why I added this additional test case. |
@davies can you take a look please? |
@dondrake The |
I'm struggling with how to handle this. I would prefer that the saveAsParquet() would handle converting the value to a Thoughts? |
The parquet write path needs to assume the data matches the schema
|
OK, I'll keep looking, but can you let me know where that is in the code? -Don |
So, in ./sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala In EvaluatePython.toJava() I added the following:
and in EvaluatePython.fromJava() I added the following:
The fromJava() is never executed, but I do see the toJava() print output. Unfortunately, this does not force the python value to a long. |
@dondrake I just noticed it is for branch 1.2. Do we have a patch for the master? |
@yhuai I have commits for branch-1.2 and branch-1.3 for this fix. Is that not correct? I am finished changes for branch-1.2 to resolve the test failing. Please advise where my changes for master/1.3 should reside. |
@dondrake Can you also create a PR against our master? In our master, |
So, I should not be updating the branch-1.3? Should I just create a branch off of master with my changes? Are my changes for 1.2.x okay to go on branch-1.2? |
In general you should open all PRs against master, and we will backport them manually when merging. Please suggest this in the comments (or after the fact add the "backport needed" tag on JIRA). If its a specialized fix that due to conflicts can only be made against a specific release branch, then its okay to make it against that branch. For Spark SQL at this point master and branch-1.3 are functionally the same. Regarding branch-1.2, I would not want to backport something that makes a change this large to a maintenance branch since it could break existing code. |
OK, this PR, which is against branch-1.2 is now updated and I've verified that the tests are now passing. I created another branch off of the master (named drake_python_long) that has the changes needed for v1.3. I'll create another PR for that one. Please test |
ok to test |
Test build #27648 has started for PR 4521 at commit
|
Test build #27648 has finished for PR 4521 at commit
|
Test PASSed. |
The
I think it's better to convert the Integer into Long in SQLContext.applySchemaToPythonRDD(), you could have a special case in needConversion() and EvaluatePython.fromJava(). We had do something for ByteType/Short, it's easy to add a case for (Int, LongType). |
… branch) This PR change to use LongType for int in Python, when inferSchema(), because IntegerType in SQL is not enough for int in Python (which is 64-bit on 64-bit machines). Closes #4521 cc dondrake marmbrus Author: Davies Liu <davies@databricks.com> Closes #4681 from davies/long2 and squashes the following commits: 05ef1c8 [Davies Liu] infer LongType for int in Python
Agreed, closed. Thanks everyone. |
… branch) This PR change to use LongType for int in Python, when inferSchema(), because IntegerType in SQL is not enough for int in Python (which is 64-bit on 64-bit machines). Closes apache#4521 cc dondrake marmbrus Author: Davies Liu <davies@databricks.com> Closes apache#4681 from davies/long2 and squashes the following commits: 05ef1c8 [Davies Liu] infer LongType for int in Python
Please review.