[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide #26045

BryanCutler · 2019-10-07T18:34:34Z

What changes were proposed in this pull request?

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

Why are the changes needed?

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

BryanCutler · 2019-10-07T18:35:40Z

cc @HyukjinKwon @harpaj

SparkQA · 2019-10-07T18:46:04Z

Test build #111850 has finished for PR 26045 at commit 735cf05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Nice! I will test PySpark and SparkR and merge this one!

HyukjinKwon · 2019-10-07T23:49:09Z

docs/sql-pyspark-pandas-with-arrow.md

@@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim
 different than a Pandas timestamp. It is recommended to use Pandas time series functionality when
 working with timestamps in `pandas_udf`s to get the best performance, see
 [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.
+
+### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x


Oh, actually @BryanCutler, would you mind adding this (or link back) at https://github.com/apache/spark/blob/master/docs/sparkr.md#apache-arrow-in-sparkr ? No worry about testing it out, I will do it tonight.

@HyukjinKwon so you don't need this note for R, Arrow was not used in 2.4.x?

dongjoon-hyun · 2019-10-07T23:52:58Z

docs/sql-pyspark-pandas-with-arrow.md

+ARROW_PRE_0_15_IPC_FORMAT=1
+```
+
+This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x.


Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.

From the wording, does it mean if using Spark 3.0 which with newer Arrow Java, you do not need to set it?

Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?

If so, we don't have to deal with https://github.com/apache/spark/pull/26045/files#r332285077 since Arrow with R is new in Spark 3.0.

If that's the case, increasing minimum version of Arrow R to 0.15.0 is fine to me too.

We aren't updating Arrow in 2.x, right? This would just be for users who go offroad and manually update it in their usage of Pyspark?

Yea, we don't update in 2.x I guess and seems this note targets 2.x.

I was wondering if we're going to upgrade the Arrow of JVM in Spark 3.0, and if it provides compatibility with lower PyArrow and Arrow R - If we don't, we should increase the minimal versions of PyArrow and Arrow R.

Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.

It's not a requirement for migrating, because it's only when upgrading pyarrow to 0.15.0 with Spark 2.x. Although, someone from Spark 2.3.x might have pyarrow=0.8.0 and then migrating to Spark 2.4.x will be forced to upgrade pyarrow to a minimum version of 0.12.1, and might end up with 0.15.0 and need the env var. It's a bit of a stretch I, but if you think it's could help to add it in the guide I will.

Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?

So once we upgrade Arrow Java to 0.15.0, it is not necessary to set the env var and will work with older versions of pyarrow also. Because of this, I don't think it's necessary to increase the minimum version right now. I do think we will have Arrow 1.0 before Spark 3.0, so it would make sense to set that as the minimum version.

We aren't updating Arrow in 2.x, right?

I wondered about this, I know we don't usually do this but it would remove the need for the env var I believe. Is it something we should consider?

Thanks for clarification! Then, I guess we don;t need a note for R one.

BryanCutler · 2019-10-07T23:53:40Z

Yup, I can add it there too

…

On Mon, Oct 7, 2019, 4:51 PM Hyukjin Kwon ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In docs/sql-pyspark-pandas-with-arrow.md <#26045 (comment)>: > @@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim different than a Pandas timestamp. It is recommended to use Pandas time series functionality when working with timestamps in `pandas_udf`s to get the best performance, see [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. + +### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x Oh, actually @BryanCutler <https://github.com/BryanCutler>, would you mind adding this (or link back) at https://github.com/apache/spark/blob/master/docs/sparkr.md#apache-arrow-in-sparkr ? No worry about testing it out, I will do it tonight. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26045?email_source=notifications&email_token=ABCTA5M26S4523ZU2DP4ORDQNPDO5A5CNFSM4I6IG4R2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHFE4QA#pullrequestreview-298470976>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABCTA5OV22U5IEX2RIKRWOTQNPDO5ANCNFSM4I6IG4RQ> .

gatorsmile · 2019-10-08T00:00:56Z

docs/sql-pyspark-pandas-with-arrow.md

+Spark so that PySpark maintain compatibility with versions on PyArrow 0.15.0 and above. The following can be added to `conf/spark-env.sh` to use the legacy IPC format:
+
+```
+ARROW_PRE_0_15_IPC_FORMAT=1


Can we just set it by default?

This is for already released Spark versions, where the user has upgraded pyarrow to 0.15.0 in their cluster

I think I'd just clarify this in the new docs, that it's only needed if you manually update pyarrow this way.

Yes I agree, I'll clarify this

kiszk · 2019-10-09T02:32:59Z

docs/sql-pyspark-pandas-with-arrow.md

+
+### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
+
+Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in


How about adding a link http://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140 to the release blog of Apache Arrow?

Sure, that would be good thanks!

BryanCutler · 2019-10-09T20:55:56Z

Updated to add some clarification and links to JIRA showing the error message if the env var is not set and link to the Arrow blog.

HyukjinKwon · 2019-10-11T00:19:16Z

Merged to master.

BryanCutler · 2019-10-11T01:04:56Z

Thanks @HyukjinKwon and others for reviewing!

I'd set the upper bound for pyarrow version for now since we need to set an environment variable to work with pyarrow 0.15. apache/spark#26045: > Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. No Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

I'd set the upper bound for pyarrow version for now since we need to set an environment variable to work with pyarrow 0.15. apache/spark#26045: > Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

Add section on compatibility for Arrow 0.15.0 to SQL prog guide

735cf05

BryanCutler mentioned this pull request Oct 7, 2019

[SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now #26041

Closed

BryanCutler added the DOCUMENTATION label Oct 7, 2019

dongjoon-hyun added PYSPARK and removed DOCUMENTATION labels Oct 7, 2019

HyukjinKwon reviewed Oct 7, 2019

View reviewed changes

dongjoon-hyun reviewed Oct 7, 2019

View reviewed changes

gatorsmile reviewed Oct 8, 2019

View reviewed changes

kiszk reviewed Oct 9, 2019

View reviewed changes

Added clarification, links to jira and arrow blog

82d5f54

viirya approved these changes Oct 10, 2019

View reviewed changes

HyukjinKwon approved these changes Oct 11, 2019

View reviewed changes

HyukjinKwon closed this in 6390f02 Oct 11, 2019

BryanCutler deleted the arrow-document-legacy-IPC-fix-SPARK-29367 branch October 11, 2019 01:04

ueshin mentioned this pull request Oct 11, 2019

Set an upper bound for pyarrow version. databricks/koalas#918

Merged


		### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x

		Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in

[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide #26045

[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide #26045

Uh oh!

Conversation

BryanCutler commented Oct 7, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

BryanCutler commented Oct 7, 2019

Uh oh!

SparkQA commented Oct 7, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Oct 7, 2019 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Oct 9, 2019

Uh oh!

HyukjinKwon commented Oct 11, 2019

Uh oh!

BryanCutler commented Oct 11, 2019

Uh oh!

Uh oh!

dongjoon-hyun Oct 7, 2019 •

edited

Loading

HyukjinKwon Oct 8, 2019 •

edited

Loading