-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide #26045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -219,3 +219,20 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim | |
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when | ||
working with timestamps in `pandas_udf`s to get the best performance, see | ||
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. | ||
|
||
### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x | ||
|
||
Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be | ||
compatible with previous versions of Arrow <= 0.14.1. This is only necessary to do for PySpark | ||
users with versions 2.3.x and 2.4.x that have manually upgraded PyArrow to 0.15.0. The following | ||
can be added to `conf/spark-env.sh` to use the legacy Arrow IPC format: | ||
|
||
``` | ||
ARROW_PRE_0_15_IPC_FORMAT=1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we just set it by default? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is for already released Spark versions, where the user has upgraded pyarrow to 0.15.0 in their cluster There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'd just clarify this in the new docs, that it's only needed if you manually update pyarrow this way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I agree, I'll clarify this |
||
``` | ||
|
||
This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that | ||
is in Spark 2.3.x and 2.4.x. Not setting this environment variable will lead to a similar error as | ||
described in [SPARK-29367](https://issues.apache.org/jira/browse/SPARK-29367) when running | ||
`pandas_udf`s or `toPandas()` with Arrow enabled. More information about the Arrow IPC change can | ||
be read on the Arrow 0.15.0 release [blog](http://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually @BryanCutler, would you mind adding this (or link back) at https://github.com/apache/spark/blob/master/docs/sparkr.md#apache-arrow-in-sparkr ? No worry about testing it out, I will do it tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon so you don't need this note for R, Arrow was not used in 2.4.x?