-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide #26045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide #26045
Conversation
Test build #111850 has finished for PR 26045 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I will test PySpark and SparkR and merge this one!
@@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim | |||
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when | |||
working with timestamps in `pandas_udf`s to get the best performance, see | |||
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. | |||
|
|||
### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually @BryanCutler, would you mind adding this (or link back) at https://github.com/apache/spark/blob/master/docs/sparkr.md#apache-arrow-in-sparkr ? No worry about testing it out, I will do it tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon so you don't need this note for R, Arrow was not used in 2.4.x?
ARROW_PRE_0_15_IPC_FORMAT=1 | ||
``` | ||
|
||
This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the wording, does it mean if using Spark 3.0 which with newer Arrow Java, you do not need to set it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?
If so, we don't have to deal with https://github.com/apache/spark/pull/26045/files#r332285077 since Arrow with R is new in Spark 3.0.
If that's the case, increasing minimum version of Arrow R to 0.15.0 is fine to me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't updating Arrow in 2.x, right? This would just be for users who go offroad and manually update it in their usage of Pyspark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we don't update in 2.x I guess and seems this note targets 2.x.
I was wondering if we're going to upgrade the Arrow of JVM in Spark 3.0, and if it provides compatibility with lower PyArrow and Arrow R - If we don't, we should increase the minimal versions of PyArrow and Arrow R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.
It's not a requirement for migrating, because it's only when upgrading pyarrow to 0.15.0 with Spark 2.x. Although, someone from Spark 2.3.x might have pyarrow=0.8.0 and then migrating to Spark 2.4.x will be forced to upgrade pyarrow to a minimum version of 0.12.1, and might end up with 0.15.0 and need the env var. It's a bit of a stretch I, but if you think it's could help to add it in the guide I will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?
So once we upgrade Arrow Java to 0.15.0, it is not necessary to set the env var and will work with older versions of pyarrow also. Because of this, I don't think it's necessary to increase the minimum version right now. I do think we will have Arrow 1.0 before Spark 3.0, so it would make sense to set that as the minimum version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't updating Arrow in 2.x, right?
I wondered about this, I know we don't usually do this but it would remove the need for the env var I believe. Is it something we should consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification! Then, I guess we don;t need a note for R one.
Yup, I can add it there too
…On Mon, Oct 7, 2019, 4:51 PM Hyukjin Kwon ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In docs/sql-pyspark-pandas-with-arrow.md
<#26045 (comment)>:
> @@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when
working with timestamps in `pandas_udf`s to get the best performance, see
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.
+
+### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
Oh, actually @BryanCutler <https://github.com/BryanCutler>, would you
mind adding this (or link back) at
https://github.com/apache/spark/blob/master/docs/sparkr.md#apache-arrow-in-sparkr
? No worry about testing it out, I will do it tonight.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26045?email_source=notifications&email_token=ABCTA5M26S4523ZU2DP4ORDQNPDO5A5CNFSM4I6IG4R2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHFE4QA#pullrequestreview-298470976>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABCTA5OV22U5IEX2RIKRWOTQNPDO5ANCNFSM4I6IG4RQ>
.
|
Spark so that PySpark maintain compatibility with versions on PyArrow 0.15.0 and above. The following can be added to `conf/spark-env.sh` to use the legacy IPC format: | ||
|
||
``` | ||
ARROW_PRE_0_15_IPC_FORMAT=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just set it by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for already released Spark versions, where the user has upgraded pyarrow to 0.15.0 in their cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd just clarify this in the new docs, that it's only needed if you manually update pyarrow this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, I'll clarify this
|
||
### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x | ||
|
||
Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a link http://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140
to the release blog of Apache Arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that would be good thanks!
Updated to add some clarification and links to JIRA showing the error message if the env var is not set and link to the Arrow blog. |
Merged to master. |
Thanks @HyukjinKwon and others for reviewing! |
I'd set the upper bound for pyarrow version for now since we need to set an environment variable to work with pyarrow 0.15. apache/spark#26045: > Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. No Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
I'd set the upper bound for pyarrow version for now since we need to set an environment variable to work with pyarrow 0.15. apache/spark#26045: > Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
What changes were proposed in this pull request?
Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.
Why are the changes needed?
Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.