Skip to content

[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide #26045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

BryanCutler
Copy link
Member

What changes were proposed in this pull request?

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

Why are the changes needed?

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

@BryanCutler
Copy link
Member Author

cc @HyukjinKwon @harpaj

@SparkQA
Copy link

SparkQA commented Oct 7, 2019

Test build #111850 has finished for PR 26045 at commit 735cf05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I will test PySpark and SparkR and merge this one!

@@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when
working with timestamps in `pandas_udf`s to get the best performance, see
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.

### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually @BryanCutler, would you mind adding this (or link back) at https://github.com/apache/spark/blob/master/docs/sparkr.md#apache-arrow-in-sparkr ? No worry about testing it out, I will do it tonight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon so you don't need this note for R, Arrow was not used in 2.4.x?

ARROW_PRE_0_15_IPC_FORMAT=1
```

This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x.
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the wording, does it mean if using Spark 3.0 which with newer Arrow Java, you do not need to set it?

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?

If so, we don't have to deal with https://github.com/apache/spark/pull/26045/files#r332285077 since Arrow with R is new in Spark 3.0.

If that's the case, increasing minimum version of Arrow R to 0.15.0 is fine to me too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't updating Arrow in 2.x, right? This would just be for users who go offroad and manually update it in their usage of Pyspark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we don't update in 2.x I guess and seems this note targets 2.x.

I was wondering if we're going to upgrade the Arrow of JVM in Spark 3.0, and if it provides compatibility with lower PyArrow and Arrow R - If we don't, we should increase the minimal versions of PyArrow and Arrow R.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention this in SQL migration guide too? This sounds like a requirement for migrating from 2.3 and 2.4.

It's not a requirement for migrating, because it's only when upgrading pyarrow to 0.15.0 with Spark 2.x. Although, someone from Spark 2.3.x might have pyarrow=0.8.0 and then migrating to Spark 2.4.x will be forced to upgrade pyarrow to a minimum version of 0.12.1, and might end up with 0.15.0 and need the env var. It's a bit of a stretch I, but if you think it's could help to add it in the guide I will.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @BryanCutler do you target to upgrade and also increase minimum versions of PyArrow at SPARK-29376 (we upgrade in JVM one too; therefore, we don't need to set the environment variable in Spark 3.0)?

So once we upgrade Arrow Java to 0.15.0, it is not necessary to set the env var and will work with older versions of pyarrow also. Because of this, I don't think it's necessary to increase the minimum version right now. I do think we will have Arrow 1.0 before Spark 3.0, so it would make sense to set that as the minimum version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't updating Arrow in 2.x, right?

I wondered about this, I know we don't usually do this but it would remove the need for the env var I believe. Is it something we should consider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification! Then, I guess we don;t need a note for R one.

@BryanCutler
Copy link
Member Author

BryanCutler commented Oct 7, 2019 via email

Spark so that PySpark maintain compatibility with versions on PyArrow 0.15.0 and above. The following can be added to `conf/spark-env.sh` to use the legacy IPC format:

```
ARROW_PRE_0_15_IPC_FORMAT=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set it by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for already released Spark versions, where the user has upgraded pyarrow to 0.15.0 in their cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd just clarify this in the new docs, that it's only needed if you manually update pyarrow this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I'll clarify this


### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x

Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a link http://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140 to the release blog of Apache Arrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would be good thanks!

@BryanCutler
Copy link
Member Author

Updated to add some clarification and links to JIRA showing the error message if the env var is not set and link to the Arrow blog.

@HyukjinKwon
Copy link
Member

Merged to master.

@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon and others for reviewing!

@BryanCutler BryanCutler deleted the arrow-document-legacy-IPC-fix-SPARK-29367 branch October 11, 2019 01:04
ueshin added a commit to databricks/koalas that referenced this pull request Oct 11, 2019
I'd set the upper bound for pyarrow version for now since we need to set an environment variable to work with pyarrow 0.15.

apache/spark#26045:
> Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
hagerf pushed a commit to hagerf/spark_fork that referenced this pull request Dec 4, 2019
Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

No

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request May 23, 2020
### What changes were proposed in this pull request?

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

### Why are the changes needed?

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 24, 2020
### What changes were proposed in this pull request?

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

### Why are the changes needed?

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jul 15, 2020
### What changes were proposed in this pull request?

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

### Why are the changes needed?

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jul 15, 2020
### What changes were proposed in this pull request?

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

### Why are the changes needed?

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023
I'd set the upper bound for pyarrow version for now since we need to set an environment variable to work with pyarrow 0.15.

apache/spark#26045:
> Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants