Skip to content

Commit 50675db

Browse files
committed
Addes Hive metastore Parquet table conversion section
1 parent 6f4cadf commit 50675db

File tree

1 file changed

+67
-3
lines changed

1 file changed

+67
-3
lines changed

docs/sql-programming-guide.md

Lines changed: 67 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.
2222
All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell.
2323

2424

25-
## Starting Point: `SQLContext`
25+
## Starting Point: SQLContext
2626

2727
<div class="codetabs">
2828
<div data-lang="scala" markdown="1">
@@ -1054,7 +1054,7 @@ SELECT * FROM parquetTable
10541054

10551055
</div>
10561056

1057-
### Partition discovery
1057+
### Partition Discovery
10581058

10591059
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
10601060
table, data are usually stored in different directories, with partitioning column values encoded in
@@ -1108,7 +1108,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w
11081108
`true`. When type inference is disabled, string type will be used for the partitioning columns.
11091109

11101110

1111-
### Schema merging
1111+
### Schema Merging
11121112

11131113
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
11141114
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
@@ -1208,6 +1208,70 @@ printSchema(df3)
12081208

12091209
</div>
12101210

1211+
### Hive metastore Parquet table conversion
1212+
1213+
When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
1214+
Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
1215+
`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
1216+
1217+
#### Hive/Parquet Schema Reconciliation
1218+
1219+
There are two key differences between Hive and Parquet from the perspective of table schema
1220+
processing.
1221+
1222+
1. Hive is case insensitive, while Parquet is not
1223+
1. Hive considers all columns nullable, while nullability in Parquet is significant
1224+
1225+
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
1226+
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
1227+
1228+
1. Fields that have the same name in both schema must have the same data type regardless of
1229+
nullability. The reconciled field should have the data type of the Parquet side, so that
1230+
nullability is respected.
1231+
1232+
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
1233+
1234+
- Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
1235+
- Any fileds that only appear in the Hive metastore schema are added as nullable field in the
1236+
reconciled schema.
1237+
1238+
#### Metadata Refreshing
1239+
1240+
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
1241+
conversion is enabled, metadata of those converted tables are also cached. If these tables are
1242+
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
1243+
metadata.
1244+
1245+
<div class="codetabs">
1246+
1247+
<div data-lang="scala" markdown="1">
1248+
1249+
{% highlight scala %}
1250+
// sqlContext is an existing HiveContext
1251+
sqlContext.refreshTable("my_table")
1252+
{% endhighlight %}
1253+
1254+
</div>
1255+
1256+
<div data-lang="java" markdown="1">
1257+
1258+
{% highlight java %}
1259+
// sqlContext is an existing HiveContext
1260+
sqlContext.refreshTable("my_table")
1261+
{% endhighlight %}
1262+
1263+
</div>
1264+
1265+
<div data-lang="sql" markdown="1">
1266+
1267+
{% highlight sql %}
1268+
REFRESH TABLE my_table;
1269+
{% endhighlight %}
1270+
1271+
</div>
1272+
1273+
</div>
1274+
12111275
### Configuration
12121276

12131277
Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running

0 commit comments

Comments
 (0)