Skip to content

Commit 85e895c

Browse files
committed
added groupby-apply, fix timesamp section, address comments
1 parent 4872b63 commit 85e895c

File tree

1 file changed

+20
-20
lines changed

1 file changed

+20
-20
lines changed

docs/sql-programming-guide.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1640,7 +1640,7 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a
16401640
You may run `./bin/spark-sql --help` for a complete list of all available
16411641
options.
16421642

1643-
# Usage Guide for Pandas with Arrow
1643+
# PySpark Usage Guide for Pandas with Arrow
16441644

16451645
## Arrow in Spark
16461646

@@ -1651,19 +1651,19 @@ changes to configuration or code to take full advantage and ensure compatibility
16511651
give a high-level description of how to use Arrow in Spark and highlight any differences when
16521652
working with Arrow-enabled data.
16531653

1654-
## Ensure pyarrow Installed
1654+
### Ensure PyArrow Installed
16551655

1656-
If you install pyspark using pip, then pyarrow can be brought in as an extra dependency of the sql
1657-
module with the command "pip install pyspark[sql]". Otherwise, you must ensure that pyarrow is
1658-
installed and available on all cluster node Python environments. The current supported version is
1659-
0.8.0. You can install using pip or conda from the conda-forge channel. See pyarrow
1656+
If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the
1657+
SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow
1658+
is installed and available on all cluster nodes. The current supported version is 0.8.0.
1659+
You can install using pip or conda from the conda-forge channel. See PyArrow
16601660
[installation](https://arrow.apache.org/docs/python/install.html) for details.
16611661

1662-
## How to Enable for Conversion to/from Pandas
1662+
## Enabling for Conversion to/from Pandas
16631663

16641664
Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call
16651665
`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`.
1666-
To use Arrow when executing these calls, it first must be enabled by setting the Spark conf
1666+
To use Arrow when executing these calls, it first must be enabled by setting the Spark configuration
16671667
'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default.
16681668

16691669
<div class="codetabs">
@@ -1683,7 +1683,7 @@ pdf = pd.DataFrame(np.random.rand(100, 3))
16831683
df = spark.createDataFrame(pdf)
16841684

16851685
# Convert the Spark DataFrame to a local Pandas DataFrame
1686-
selpdf = df.select(" * ").toPandas()
1686+
selpdf = df.select("*").toPandas()
16871687

16881688
{% endhighlight %}
16891689
</div>
@@ -1751,13 +1751,14 @@ GroupBy-Apply implements the "split-apply-combine" pattern. Split-apply-combine
17511751
input data contains all the rows and columns for each group.
17521752
* Combine the results into a new `DataFrame`.
17531753

1754-
To use GroupBy-Apply, user needs to define:
1755-
* A python function that defines the computation for each group
1756-
* A `StructType` object or a string that defines the output schema of the output `DataFrame`
1754+
To use GroupBy-Apply, define the following:
1755+
1756+
* A Python function that defines the computation for each group.
1757+
* A `StructType` object or a string that defines the schema of the output `DataFrame`.
17571758

17581759
Examples:
17591760

1760-
The first example shows a simple use case: subtracting mean from each value in the group.
1761+
The first example shows a simple use case: subtracting the mean from each value in the group.
17611762

17621763
<div class="codetabs">
17631764
<div data-lang="python" markdown="1">
@@ -1864,15 +1865,14 @@ batches for processing.
18641865
Spark internally stores timestamps as UTC values, and timestamp data that is brought in without
18651866
a specified time zone is converted as local time to UTC with microsecond resolution. When timestamp
18661867
data is exported or displayed in Spark, the session time zone is used to localize the timestamp
1867-
values. The session time zone is set with the conf 'spark.sql.session.timeZone' and will default
1868-
to the JVM system local time zone if not set. Pandas uses a `datetime64` type with nanosecond
1869-
resolution, `datetime64[ns]`, and optional time zone that can be applied on a per-column basis.
1868+
values. The session time zone is set with the configuration 'spark.sql.session.timeZone' and will
1869+
default to the JVM system local time zone if not set. Pandas uses a `datetime64` type with nanosecond
1870+
resolution, `datetime64[ns]`, with optional time zone on a per-column basis.
18701871

18711872
When timestamp data is transferred from Spark to Pandas it will be converted to nanoseconds
1872-
and each column will be made time zone aware using the Spark session time zone. This will occur
1873-
when calling `toPandas()` or `pandas_udf` with a timestamp column. For example if the session time
1874-
zone is 'America/Los_Angeles' then the Pandas timestamp column will be of type
1875-
`datetime64[ns, America/Los_Angeles]`.
1873+
and each column will be converted to the Spark session time zone then localized to that time
1874+
zone, which removes the time zone and displays values as local time. This will occur
1875+
when calling `toPandas()` or `pandas_udf` with timestamp columns.
18761876

18771877
When timestamp data is transferred from Pandas to Spark, it will be converted to UTC microseconds. This
18781878
occurs when calling `createDataFrame` with a Pandas DataFrame or when returning a timestamp from a

0 commit comments

Comments
 (0)