@@ -1640,7 +1640,7 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a
1640
1640
You may run ` ./bin/spark-sql --help ` for a complete list of all available
1641
1641
options.
1642
1642
1643
- # Usage Guide for Pandas with Arrow
1643
+ # PySpark Usage Guide for Pandas with Arrow
1644
1644
1645
1645
## Arrow in Spark
1646
1646
@@ -1651,19 +1651,19 @@ changes to configuration or code to take full advantage and ensure compatibility
1651
1651
give a high-level description of how to use Arrow in Spark and highlight any differences when
1652
1652
working with Arrow-enabled data.
1653
1653
1654
- ## Ensure pyarrow Installed
1654
+ ### Ensure PyArrow Installed
1655
1655
1656
- If you install pyspark using pip, then pyarrow can be brought in as an extra dependency of the sql
1657
- module with the command " pip install pyspark[ sql] " . Otherwise, you must ensure that pyarrow is
1658
- installed and available on all cluster node Python environments . The current supported version is
1659
- 0.8.0. You can install using pip or conda from the conda-forge channel. See pyarrow
1656
+ If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the
1657
+ SQL module with the command ` pip install pyspark[sql] ` . Otherwise, you must ensure that PyArrow
1658
+ is installed and available on all cluster nodes . The current supported version is 0.8.0.
1659
+ You can install using pip or conda from the conda-forge channel. See PyArrow
1660
1660
[ installation] ( https://arrow.apache.org/docs/python/install.html ) for details.
1661
1661
1662
- ## How to Enable for Conversion to/from Pandas
1662
+ ## Enabling for Conversion to/from Pandas
1663
1663
1664
1664
Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call
1665
1665
` toPandas() ` and when creating a Spark DataFrame from Pandas with ` createDataFrame(pandas_df) ` .
1666
- To use Arrow when executing these calls, it first must be enabled by setting the Spark conf
1666
+ To use Arrow when executing these calls, it first must be enabled by setting the Spark configuration
1667
1667
'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default.
1668
1668
1669
1669
<div class =" codetabs " >
@@ -1683,7 +1683,7 @@ pdf = pd.DataFrame(np.random.rand(100, 3))
1683
1683
df = spark.createDataFrame(pdf)
1684
1684
1685
1685
# Convert the Spark DataFrame to a local Pandas DataFrame
1686
- selpdf = df.select(" * ").toPandas()
1686
+ selpdf = df.select("* ").toPandas()
1687
1687
1688
1688
{% endhighlight %}
1689
1689
</div >
@@ -1751,13 +1751,14 @@ GroupBy-Apply implements the "split-apply-combine" pattern. Split-apply-combine
1751
1751
input data contains all the rows and columns for each group.
1752
1752
* Combine the results into a new ` DataFrame ` .
1753
1753
1754
- To use GroupBy-Apply, user needs to define:
1755
- * A python function that defines the computation for each group
1756
- * A ` StructType ` object or a string that defines the output schema of the output ` DataFrame `
1754
+ To use GroupBy-Apply, define the following:
1755
+
1756
+ * A Python function that defines the computation for each group.
1757
+ * A ` StructType ` object or a string that defines the schema of the output ` DataFrame ` .
1757
1758
1758
1759
Examples:
1759
1760
1760
- The first example shows a simple use case: subtracting mean from each value in the group.
1761
+ The first example shows a simple use case: subtracting the mean from each value in the group.
1761
1762
1762
1763
<div class =" codetabs " >
1763
1764
<div data-lang =" python " markdown =" 1 " >
@@ -1864,15 +1865,14 @@ batches for processing.
1864
1865
Spark internally stores timestamps as UTC values, and timestamp data that is brought in without
1865
1866
a specified time zone is converted as local time to UTC with microsecond resolution. When timestamp
1866
1867
data is exported or displayed in Spark, the session time zone is used to localize the timestamp
1867
- values. The session time zone is set with the conf 'spark.sql.session.timeZone' and will default
1868
- to the JVM system local time zone if not set. Pandas uses a ` datetime64 ` type with nanosecond
1869
- resolution, ` datetime64[ns] ` , and optional time zone that can be applied on a per-column basis.
1868
+ values. The session time zone is set with the configuration 'spark.sql.session.timeZone' and will
1869
+ default to the JVM system local time zone if not set. Pandas uses a ` datetime64 ` type with nanosecond
1870
+ resolution, ` datetime64[ns] ` , with optional time zone on a per-column basis.
1870
1871
1871
1872
When timestamp data is transferred from Spark to Pandas it will be converted to nanoseconds
1872
- and each column will be made time zone aware using the Spark session time zone. This will occur
1873
- when calling ` toPandas() ` or ` pandas_udf ` with a timestamp column. For example if the session time
1874
- zone is 'America/Los_Angeles' then the Pandas timestamp column will be of type
1875
- ` datetime64[ns, America/Los_Angeles] ` .
1873
+ and each column will be converted to the Spark session time zone then localized to that time
1874
+ zone, which removes the time zone and displays values as local time. This will occur
1875
+ when calling ` toPandas() ` or ` pandas_udf ` with timestamp columns.
1876
1876
1877
1877
When timestamp data is transferred from Pandas to Spark, it will be converted to UTC microseconds. This
1878
1878
occurs when calling ` createDataFrame ` with a Pandas DataFrame or when returning a timestamp from a
0 commit comments