Skip to content

Commit d96d7b5

Browse files
committed
[DOC] [SQL] Addes Hive metastore Parquet table conversion section
This PR adds a section about Hive metastore Parquet table conversion. It documents: 1. Schema reconciliation rules introduced in apache#5214 (see [this comment] [1] in apache#5188) 2. Metadata refreshing requirement introduced in apache#5339 [1]: apache#5188 (comment) Author: Cheng Lian <lian@databricks.com> Closes apache#5348 from liancheng/sql-doc-parquet-conversion and squashes the following commits: 42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet 4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet 756e660 [Cheng Lian] Adds Python snippet for metadata refreshing 50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section
1 parent a803118 commit d96d7b5

File tree

1 file changed

+88
-6
lines changed

1 file changed

+88
-6
lines changed

docs/sql-programming-guide.md

Lines changed: 88 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.
2222
All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell.
2323

2424

25-
## Starting Point: `SQLContext`
25+
## Starting Point: SQLContext
2626

2727
<div class="codetabs">
2828
<div data-lang="scala" markdown="1">
@@ -1036,6 +1036,15 @@ for (teenName in collect(teenNames)) {
10361036

10371037
</div>
10381038

1039+
<div data-lang="python" markdown="1">
1040+
1041+
{% highlight python %}
1042+
# sqlContext is an existing HiveContext
1043+
sqlContext.sql("REFRESH TABLE my_table")
1044+
{% endhighlight %}
1045+
1046+
</div>
1047+
10391048
<div data-lang="sql" markdown="1">
10401049

10411050
{% highlight sql %}
@@ -1054,7 +1063,7 @@ SELECT * FROM parquetTable
10541063

10551064
</div>
10561065

1057-
### Partition discovery
1066+
### Partition Discovery
10581067

10591068
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
10601069
table, data are usually stored in different directories, with partitioning column values encoded in
@@ -1108,7 +1117,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w
11081117
`true`. When type inference is disabled, string type will be used for the partitioning columns.
11091118

11101119

1111-
### Schema merging
1120+
### Schema Merging
11121121

11131122
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
11141123
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
@@ -1208,6 +1217,79 @@ printSchema(df3)
12081217

12091218
</div>
12101219

1220+
### Hive metastore Parquet table conversion
1221+
1222+
When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
1223+
Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
1224+
`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
1225+
1226+
#### Hive/Parquet Schema Reconciliation
1227+
1228+
There are two key differences between Hive and Parquet from the perspective of table schema
1229+
processing.
1230+
1231+
1. Hive is case insensitive, while Parquet is not
1232+
1. Hive considers all columns nullable, while nullability in Parquet is significant
1233+
1234+
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
1235+
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
1236+
1237+
1. Fields that have the same name in both schema must have the same data type regardless of
1238+
nullability. The reconciled field should have the data type of the Parquet side, so that
1239+
nullability is respected.
1240+
1241+
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
1242+
1243+
- Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
1244+
- Any fileds that only appear in the Hive metastore schema are added as nullable field in the
1245+
reconciled schema.
1246+
1247+
#### Metadata Refreshing
1248+
1249+
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
1250+
conversion is enabled, metadata of those converted tables are also cached. If these tables are
1251+
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
1252+
metadata.
1253+
1254+
<div class="codetabs">
1255+
1256+
<div data-lang="scala" markdown="1">
1257+
1258+
{% highlight scala %}
1259+
// sqlContext is an existing HiveContext
1260+
sqlContext.refreshTable("my_table")
1261+
{% endhighlight %}
1262+
1263+
</div>
1264+
1265+
<div data-lang="java" markdown="1">
1266+
1267+
{% highlight java %}
1268+
// sqlContext is an existing HiveContext
1269+
sqlContext.refreshTable("my_table")
1270+
{% endhighlight %}
1271+
1272+
</div>
1273+
1274+
<div data-lang="python" markdown="1">
1275+
1276+
{% highlight python %}
1277+
# sqlContext is an existing HiveContext
1278+
sqlContext.refreshTable("my_table")
1279+
{% endhighlight %}
1280+
1281+
</div>
1282+
1283+
<div data-lang="sql" markdown="1">
1284+
1285+
{% highlight sql %}
1286+
REFRESH TABLE my_table;
1287+
{% endhighlight %}
1288+
1289+
</div>
1290+
1291+
</div>
1292+
12111293
### Configuration
12121294

12131295
Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running
@@ -1445,8 +1527,8 @@ This command builds a new assembly jar that includes Hive. Note that this Hive a
14451527
on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
14461528
(SerDes) in order to access data stored in Hive.
14471529

1448-
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running
1449-
the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
1530+
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running
1531+
the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
14501532
and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the
14511533
YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the
14521534
`spark-submit` command.
@@ -1889,7 +1971,7 @@ options.
18891971
#### DataFrame data reader/writer interface
18901972

18911973
Based on user feedback, we created a new, more fluid API for reading data in (`SQLContext.read`)
1892-
and writing data out (`DataFrame.write`),
1974+
and writing data out (`DataFrame.write`),
18931975
and deprecated the old APIs (e.g. `SQLContext.parquetFile`, `SQLContext.jsonFile`).
18941976

18951977
See the API docs for `SQLContext.read` (

0 commit comments

Comments
 (0)