Skip to content

[SPARK-5722][SQL] fix for infer long type in python similar to Java long #4521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

dondrake
Copy link

Please review.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Feb 11, 2015

Do you mind adding [SQL] to the pull request title, after [SPARK-5722]?

@rxin
Copy link
Contributor

rxin commented Feb 11, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27275 has started for PR 4521 at commit f53b94c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27275 has finished for PR 4521 at commit f53b94c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27275/
Test FAILed.

@davies
Copy link
Contributor

davies commented Feb 11, 2015

@dondrake This looks good to me, just one comment, thanks!

In Python, we infer the schema only based on the first few rows, so it's possible to get IntegerType for large Python int after this patch. In this case, user could provide a correct schema to workaround it.

Could you add a comment for this?

@dondrake dondrake changed the title [SPARK-5722] fix for infer long type in python similar to Java long [SPARK-5722][SQL] fix for infer long type in python similar to Java long Feb 11, 2015
@dondrake
Copy link
Author

@rxin I updated the title of the pull request.

@davies In regards to inferSchema(), this is a PR for v1.2, I'm going to submit another PR for 1.3 that will use the new DataFrame().

In regards to the comment, where would you like to see that? I understand how inferSchema() works, all rows of my dataset contained had data that was in the edge case.

@@ -605,6 +605,10 @@ def _infer_type(obj):

dataType = _type_mappings.get(type(obj))
if dataType is not None:
# Conform to Java int/long sizes SPARK-5722
if dataType == IntegerType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #598 has started for PR 4521 at commit f53b94c.

  • This patch merges cleanly.

@davies
Copy link
Contributor

davies commented Feb 11, 2015

@dondrake After adding a comment, I think it's ready to go.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #598 has finished for PR 4521 at commit f53b94c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dondrake
Copy link
Author

OK, final changes applied.

Also, I created another PR for branch-1.3 #4538

@rxin
Copy link
Contributor

rxin commented Feb 11, 2015

Jenkins, retest this please.

@marmbrus
Copy link
Contributor

Is there a reason to just not always use long (unless the user says int explicitly)? The memory savings don't seem worth the chance of throwing confusing errors or truncating. Especially since parquet / the inmemory cache use zigzag encoding.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27303 has started for PR 4521 at commit a694093.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27303 has finished for PR 4521 at commit a694093.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27303/
Test FAILed.

@dondrake
Copy link
Author

My bad, a test was accidentally pasted as part of the branch-1.3 commit. Now fixed, committed and pushed. please test again.

@rxin
Copy link
Contributor

rxin commented Feb 11, 2015

BTW I think Michael has a good point - always use longs. It is cheap to do that, and a lot more robust against inference errors. This doesn't matter much in Python since there isn't compile time type anyway.

@dondrake
Copy link
Author

@rxin, after thinking about it some more, I agree with @marmbrus . Let me make another change so _type mappings looks like this:

_type_mappings = {
    type(None): NullType,
    bool: BooleanType,
    int: LongType,
    long: LongType,
    float: DoubleType,
    str: StringType,
    unicode: StringType,
    bytearray: BinaryType,
    decimal.Decimal: DecimalType,
    datetime.date: DateType,
    datetime.datetime: TimestampType,
    datetime.time: TimestampType,
}

I'll remove my if statement and add some comments around that as well.

This also reduces the risk of the inference seeing a small sample and choosing IntegerType.

@rxin
Copy link
Contributor

rxin commented Feb 11, 2015

Thanks for doing it.

@yhuai we should do the same thing for JSON type inference too. Unless we see a decimal, integer types should by default just use LongType.

@yhuai
Copy link
Contributor

yhuai commented Feb 12, 2015

Sounds good. #4544 is for that.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27450/
Test FAILed.

@dondrake
Copy link
Author

This failure comes from my test, but it shouldn't fail when saving a Long with the exception can't convert Integer to Long.

File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql.py", line 1475, in pyspark.sql.SQLContext.parquetFile
Failed example:
    srdd.saveAsParquetFile(parquetFile)
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
        compileflags, 1) in test.globs
      File "<doctest pyspark.sql.SQLContext.parquetFile[4]>", line 1, in <module>
        srdd.saveAsParquetFile(parquetFile)
      File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql.py", line 1906, in saveAsParquetFile
        self._jschema_rdd.saveAsParquetFile(path)
      File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
        self.target_id, self.name)
      File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
        format(target_id, '.', name), value)
    Py4JJavaError: An error occurred while calling o715.saveAsParquetFile.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 55.0 failed 1 times, most recent failure: Lost task 3.0 in stage 55.0 (TID 147, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
        at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
        at org.apache.spark.sql.catalyst.expressions.GenericRow.getLong(Row.scala:153)
        at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:350)
        at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
        at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
        at 

It appears to be casting (not converting) an Integer to a Long, which you can't do. But, why does it think this is an Integer in the first place when it's defined as a LongType in Python and the spark Scala code??

I can confirm that I did see this in Spark 1.2.0 which motivated me to start this JIRA, and why I added this additional test case.

@marmbrus
Copy link
Contributor

@davies can you take a look please?

@davies
Copy link
Contributor

davies commented Feb 13, 2015

@dondrake The int in Python is inferred as LongType, but is not converted into long, then it's pickled and unpickled as Integer in JVM by Pyrolite. So you need to convert the int to long in Python, or convert the Integer to Long in JVM.

@dondrake
Copy link
Author

I'm struggling with how to handle this. I would prefer that the saveAsParquet() would handle converting the value to a long for me. However, I could update the test to store a long datatype, but again that means if I update a SchemaRDD long value in python, I have to guarantee it is a long. Not very pythonic IMO.

Thoughts?

@marmbrus
Copy link
Contributor

The parquet write path needs to assume the data matches the schema
otherwise we'll slow down all writing of data to parquet. Instead I
suggest we check at the jvm side of the Python jvm boundry and convert ints
to longs.
On Feb 14, 2015 11:09 AM, "Don Drake" notifications@github.com wrote:

I'm struggling with how to handle this. I would prefer that the
saveAsParquet() would handle converting the value to a long for me.
However, I could update the test to store a long datatype, but again that
means if I update a SchemaRDD long value in python, I have to guarantee
it is a long. Not very pythonic IMO.

Thoughts?


Reply to this email directly or view it on GitHub
#4521 (comment).

@dondrake
Copy link
Author

OK, I'll keep looking, but can you let me know where that is in the code?

-Don

@dondrake
Copy link
Author

So, in ./sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala

In EvaluatePython.toJava() I added the following:

    case (_, LongType) => {
        println("Converting to LongType" + obj);
        obj.asInstanceOf[Long]
    }

and in EvaluatePython.fromJava() I added the following:

    case (_, LongType) => {
        println("fromJava _ obj=" +obj);
        obj.asInstanceOf[Long]
    }

The fromJava() is never executed, but I do see the toJava() print output.

Unfortunately, this does not force the python value to a long.

@yhuai
Copy link
Contributor

yhuai commented Feb 16, 2015

@dondrake I just noticed it is for branch 1.2. Do we have a patch for the master?

@dondrake
Copy link
Author

@yhuai I have commits for branch-1.2 and branch-1.3 for this fix. Is that not correct?

I am finished changes for branch-1.2 to resolve the test failing. Please advise where my changes for master/1.3 should reside.

@yhuai
Copy link
Contributor

yhuai commented Feb 16, 2015

@dondrake Can you also create a PR against our master? In our master, _type_mappings is in https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py and your test can go in https://github.com/apache/spark/blob/master/python/pyspark/sql/tests.py.

@dondrake
Copy link
Author

So, I should not be updating the branch-1.3? Should I just create a branch off of master with my changes?

Are my changes for 1.2.x okay to go on branch-1.2?

@marmbrus
Copy link
Contributor

In general you should open all PRs against master, and we will backport them manually when merging. Please suggest this in the comments (or after the fact add the "backport needed" tag on JIRA).

If its a specialized fix that due to conflicts can only be made against a specific release branch, then its okay to make it against that branch. For Spark SQL at this point master and branch-1.3 are functionally the same. Regarding branch-1.2, I would not want to backport something that makes a change this large to a maintenance branch since it could break existing code.

@dondrake
Copy link
Author

OK, this PR, which is against branch-1.2 is now updated and I've verified that the tests are now passing.

I created another branch off of the master (named drake_python_long) that has the changes needed for v1.3. I'll create another PR for that one.

Please test

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27648 has started for PR 4521 at commit 69ce6d0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27648 has finished for PR 4521 at commit 69ce6d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27648/
Test PASSed.

@davies
Copy link
Contributor

davies commented Feb 17, 2015

The long in JVM will be converted into int in Python by Pyrolite, this change sounds reasonable.

long in Python is actually BigInteger, it's much slower than int, also it's slow to pickle/unpickle.

I think it's better to convert the Integer into Long in SQLContext.applySchemaToPythonRDD(), you could have a special case in needConversion() and EvaluatePython.fromJava(). We had do something for ByteType/Short, it's easy to add a case for (Int, LongType).

@davies
Copy link
Contributor

davies commented Feb 18, 2015

@dondrake Because we need to cut 1.3-RC1 now, I create another PR for master and 1.3, see #4666

asfgit pushed a commit that referenced this pull request Feb 24, 2015
… branch)

This PR change to use LongType for int in Python, when inferSchema(), because IntegerType in SQL is not enough for int in Python (which is 64-bit on 64-bit machines).

Closes #4521

cc dondrake marmbrus

Author: Davies Liu <davies@databricks.com>

Closes #4681 from davies/long2 and squashes the following commits:

05ef1c8 [Davies Liu] infer LongType for int in Python
@srowen
Copy link
Member

srowen commented Feb 24, 2015

Can this be closed @dondrake ? Looks like this was resolved in #4666 and #4681

@dondrake
Copy link
Author

Agreed, closed. Thanks everyone.

@dondrake dondrake closed this Feb 24, 2015
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Feb 24, 2015
… branch)

This PR change to use LongType for int in Python, when inferSchema(), because IntegerType in SQL is not enough for int in Python (which is 64-bit on 64-bit machines).

Closes apache#4521

cc dondrake marmbrus

Author: Davies Liu <davies@databricks.com>

Closes apache#4681 from davies/long2 and squashes the following commits:

05ef1c8 [Davies Liu] infer LongType for int in Python
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants