You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sql-programming-guide.md
+67-3Lines changed: 67 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.
22
22
All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell.
23
23
24
24
25
-
## Starting Point: `SQLContext`
25
+
## Starting Point: SQLContext
26
26
27
27
<divclass="codetabs">
28
28
<divdata-lang="scala"markdown="1">
@@ -1054,7 +1054,7 @@ SELECT * FROM parquetTable
1054
1054
1055
1055
</div>
1056
1056
1057
-
### Partition discovery
1057
+
### Partition Discovery
1058
1058
1059
1059
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
1060
1060
table, data are usually stored in different directories, with partitioning column values encoded in
@@ -1108,7 +1108,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w
1108
1108
`true`. When type inference is disabled, string type will be used for the partitioning columns.
1109
1109
1110
1110
1111
-
### Schema merging
1111
+
### Schema Merging
1112
1112
1113
1113
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
1114
1114
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
@@ -1208,6 +1208,70 @@ printSchema(df3)
1208
1208
1209
1209
</div>
1210
1210
1211
+
### Hive metastore Parquet table conversion
1212
+
1213
+
When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
1214
+
Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
1215
+
`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
1216
+
1217
+
#### Hive/Parquet Schema Reconciliation
1218
+
1219
+
There are two key differences between Hive and Parquet from the perspective of table schema
1220
+
processing.
1221
+
1222
+
1. Hive is case insensitive, while Parquet is not
1223
+
1. Hive considers all columns nullable, while nullability in Parquet is significant
1224
+
1225
+
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
1226
+
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
1227
+
1228
+
1. Fields that have the same name in both schema must have the same data type regardless of
1229
+
nullability. The reconciled field should have the data type of the Parquet side, so that
1230
+
nullability is respected.
1231
+
1232
+
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
1233
+
1234
+
- Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
1235
+
- Any fileds that only appear in the Hive metastore schema are added as nullable field in the
1236
+
reconciled schema.
1237
+
1238
+
#### Metadata Refreshing
1239
+
1240
+
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
1241
+
conversion is enabled, metadata of those converted tables are also cached. If these tables are
1242
+
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
1243
+
metadata.
1244
+
1245
+
<divclass="codetabs">
1246
+
1247
+
<divdata-lang="scala"markdown="1">
1248
+
1249
+
{% highlight scala %}
1250
+
// sqlContext is an existing HiveContext
1251
+
sqlContext.refreshTable("my_table")
1252
+
{% endhighlight %}
1253
+
1254
+
</div>
1255
+
1256
+
<divdata-lang="java"markdown="1">
1257
+
1258
+
{% highlight java %}
1259
+
// sqlContext is an existing HiveContext
1260
+
sqlContext.refreshTable("my_table")
1261
+
{% endhighlight %}
1262
+
1263
+
</div>
1264
+
1265
+
<divdata-lang="sql"markdown="1">
1266
+
1267
+
{% highlight sql %}
1268
+
REFRESH TABLE my_table;
1269
+
{% endhighlight %}
1270
+
1271
+
</div>
1272
+
1273
+
</div>
1274
+
1211
1275
### Configuration
1212
1276
1213
1277
Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running
0 commit comments