You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sql-programming-guide.md
+88-6Lines changed: 88 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.
22
22
All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell.
23
23
24
24
25
-
## Starting Point: `SQLContext`
25
+
## Starting Point: SQLContext
26
26
27
27
<divclass="codetabs">
28
28
<divdata-lang="scala"markdown="1">
@@ -1036,6 +1036,15 @@ for (teenName in collect(teenNames)) {
1036
1036
1037
1037
</div>
1038
1038
1039
+
<divdata-lang="python"markdown="1">
1040
+
1041
+
{% highlight python %}
1042
+
# sqlContext is an existing HiveContext
1043
+
sqlContext.sql("REFRESH TABLE my_table")
1044
+
{% endhighlight %}
1045
+
1046
+
</div>
1047
+
1039
1048
<divdata-lang="sql"markdown="1">
1040
1049
1041
1050
{% highlight sql %}
@@ -1054,7 +1063,7 @@ SELECT * FROM parquetTable
1054
1063
1055
1064
</div>
1056
1065
1057
-
### Partition discovery
1066
+
### Partition Discovery
1058
1067
1059
1068
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
1060
1069
table, data are usually stored in different directories, with partitioning column values encoded in
@@ -1108,7 +1117,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w
1108
1117
`true`. When type inference is disabled, string type will be used for the partitioning columns.
1109
1118
1110
1119
1111
-
### Schema merging
1120
+
### Schema Merging
1112
1121
1113
1122
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
1114
1123
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
@@ -1208,6 +1217,79 @@ printSchema(df3)
1208
1217
1209
1218
</div>
1210
1219
1220
+
### Hive metastore Parquet table conversion
1221
+
1222
+
When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
1223
+
Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
1224
+
`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
1225
+
1226
+
#### Hive/Parquet Schema Reconciliation
1227
+
1228
+
There are two key differences between Hive and Parquet from the perspective of table schema
1229
+
processing.
1230
+
1231
+
1. Hive is case insensitive, while Parquet is not
1232
+
1. Hive considers all columns nullable, while nullability in Parquet is significant
1233
+
1234
+
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
1235
+
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
1236
+
1237
+
1. Fields that have the same name in both schema must have the same data type regardless of
1238
+
nullability. The reconciled field should have the data type of the Parquet side, so that
1239
+
nullability is respected.
1240
+
1241
+
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
1242
+
1243
+
- Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
1244
+
- Any fileds that only appear in the Hive metastore schema are added as nullable field in the
1245
+
reconciled schema.
1246
+
1247
+
#### Metadata Refreshing
1248
+
1249
+
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
1250
+
conversion is enabled, metadata of those converted tables are also cached. If these tables are
1251
+
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
1252
+
metadata.
1253
+
1254
+
<divclass="codetabs">
1255
+
1256
+
<divdata-lang="scala"markdown="1">
1257
+
1258
+
{% highlight scala %}
1259
+
// sqlContext is an existing HiveContext
1260
+
sqlContext.refreshTable("my_table")
1261
+
{% endhighlight %}
1262
+
1263
+
</div>
1264
+
1265
+
<divdata-lang="java"markdown="1">
1266
+
1267
+
{% highlight java %}
1268
+
// sqlContext is an existing HiveContext
1269
+
sqlContext.refreshTable("my_table")
1270
+
{% endhighlight %}
1271
+
1272
+
</div>
1273
+
1274
+
<divdata-lang="python"markdown="1">
1275
+
1276
+
{% highlight python %}
1277
+
# sqlContext is an existing HiveContext
1278
+
sqlContext.refreshTable("my_table")
1279
+
{% endhighlight %}
1280
+
1281
+
</div>
1282
+
1283
+
<divdata-lang="sql"markdown="1">
1284
+
1285
+
{% highlight sql %}
1286
+
REFRESH TABLE my_table;
1287
+
{% endhighlight %}
1288
+
1289
+
</div>
1290
+
1291
+
</div>
1292
+
1211
1293
### Configuration
1212
1294
1213
1295
Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running
@@ -1445,8 +1527,8 @@ This command builds a new assembly jar that includes Hive. Note that this Hive a
1445
1527
on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
1446
1528
(SerDes) in order to access data stored in Hive.
1447
1529
1448
-
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running
1449
-
the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
1530
+
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running
1531
+
the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
1450
1532
and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the
1451
1533
YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the
1452
1534
`spark-submit` command.
@@ -1889,7 +1971,7 @@ options.
1889
1971
#### DataFrame data reader/writer interface
1890
1972
1891
1973
Based on user feedback, we created a new, more fluid API for reading data in (`SQLContext.read`)
1892
-
and writing data out (`DataFrame.write`),
1974
+
and writing data out (`DataFrame.write`),
1893
1975
and deprecated the old APIs (e.g. `SQLContext.parquetFile`, `SQLContext.jsonFile`).
0 commit comments