Skip to content

Commit 7eb6955

Browse files
committed
Docs for the new Parquet data source
1 parent 415eefb commit 7eb6955

File tree

1 file changed

+154
-31
lines changed

1 file changed

+154
-31
lines changed

docs/sql-programming-guide.md

Lines changed: 154 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc)
100100
val df = sqlContext.jsonFile("examples/src/main/resources/people.json")
101101

102102
// Displays the content of the DataFrame to stdout
103-
df.show()
103+
df.show()
104104
{% endhighlight %}
105105

106106
</div>
@@ -151,10 +151,10 @@ val df = sqlContext.jsonFile("examples/src/main/resources/people.json")
151151

152152
// Show the content of the DataFrame
153153
df.show()
154-
// age name
154+
// age name
155155
// null Michael
156-
// 30 Andy
157-
// 19 Justin
156+
// 30 Andy
157+
// 19 Justin
158158

159159
// Print the schema in a tree format
160160
df.printSchema()
@@ -164,17 +164,17 @@ df.printSchema()
164164

165165
// Select only the "name" column
166166
df.select("name").show()
167-
// name
167+
// name
168168
// Michael
169-
// Andy
170-
// Justin
169+
// Andy
170+
// Justin
171171

172172
// Select everybody, but increment the age by 1
173173
df.select("name", df("age") + 1).show()
174174
// name (age + 1)
175-
// Michael null
176-
// Andy 31
177-
// Justin 20
175+
// Michael null
176+
// Andy 31
177+
// Justin 20
178178

179179
// Select people older than 21
180180
df.filter(df("name") > 21).show()
@@ -201,10 +201,10 @@ DataFrame df = sqlContext.jsonFile("examples/src/main/resources/people.json");
201201

202202
// Show the content of the DataFrame
203203
df.show();
204-
// age name
204+
// age name
205205
// null Michael
206-
// 30 Andy
207-
// 19 Justin
206+
// 30 Andy
207+
// 19 Justin
208208

209209
// Print the schema in a tree format
210210
df.printSchema();
@@ -214,17 +214,17 @@ df.printSchema();
214214

215215
// Select only the "name" column
216216
df.select("name").show();
217-
// name
217+
// name
218218
// Michael
219-
// Andy
220-
// Justin
219+
// Andy
220+
// Justin
221221

222222
// Select everybody, but increment the age by 1
223223
df.select("name", df.col("age").plus(1)).show();
224224
// name (age + 1)
225-
// Michael null
226-
// Andy 31
227-
// Justin 20
225+
// Michael null
226+
// Andy 31
227+
// Justin 20
228228

229229
// Select people older than 21
230230
df.filter(df("name") > 21).show();
@@ -251,10 +251,10 @@ df = sqlContext.jsonFile("examples/src/main/resources/people.json")
251251

252252
# Show the content of the DataFrame
253253
df.show()
254-
## age name
254+
## age name
255255
## null Michael
256-
## 30 Andy
257-
## 19 Justin
256+
## 30 Andy
257+
## 19 Justin
258258

259259
# Print the schema in a tree format
260260
df.printSchema()
@@ -264,17 +264,17 @@ df.printSchema()
264264

265265
# Select only the "name" column
266266
df.select("name").show()
267-
## name
267+
## name
268268
## Michael
269-
## Andy
270-
## Justin
269+
## Andy
270+
## Justin
271271

272272
# Select everybody, but increment the age by 1
273273
df.select("name", df.age + 1).show()
274274
## name (age + 1)
275-
## Michael null
276-
## Andy 31
277-
## Justin 20
275+
## Michael null
276+
## Andy 31
277+
## Justin 20
278278

279279
# Select people older than 21
280280
df.filter(df.name > 21).show()
@@ -907,6 +907,129 @@ SELECT * FROM parquetTable
907907

908908
</div>
909909

910+
### Partition discovery
911+
912+
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
913+
table, data are usually stored in different directories, with partitioning column values encoded in
914+
the path of each partition directory. The Parquet data source is now able to discover and infer
915+
partitioning information automatically. For exmaple, we can store all our previously used
916+
population data into a partitioned table using the following directory structure, with two extra
917+
columns, `sex` and `country` as partitioning columns:
918+
919+
{% highlight text %}
920+
921+
path
922+
└── to
923+
└── table
924+
├── sex=0
925+
│   ├── ...
926+
│   │
927+
│   ├── country=US
928+
│   │   └── data.parquet
929+
│   ├── country=CN
930+
│   │   └── data.parquet
931+
│   └── ...
932+
└── sex=1
933+
   ├── ...
934+
   │
935+
   ├── country=US
936+
   │   └── data.parquet
937+
   ├── country=CN
938+
   │   └── data.parquet
939+
   └── ...
940+
941+
{% endhighlight %}
942+
943+
By passing `path/to/table` to either `SQLContext.parquetFile` or `SQLContext.load`, Spark SQL will
944+
automatically extract the partitioning information from the paths. Now the schema of the returned
945+
DataFrame becomes:
946+
947+
{% highlight text %}
948+
949+
root
950+
|-- name: string (nullable = true)
951+
|-- age: long (nullable = true)
952+
|-- sex: string (nullable = true)
953+
|-- country: string (nullable = true)
954+
955+
{% endhighlight %}
956+
957+
Notice that the data types of the partitioning columns are automatically inferred. Currently,
958+
numeric data types and string type are supported.
959+
960+
### Schema merging
961+
962+
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
963+
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
964+
up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
965+
source is now able to automatically detect this case and merge schemas of all these files.
966+
967+
<div class="codetabs">
968+
969+
<div data-lang="scala" markdown="1">
970+
971+
{% highlight scala %}
972+
// sqlContext from the previous example is used in this example.
973+
// This is used to implicitly convert an RDD to a DataFrame.
974+
import sqlContext.implicits._
975+
976+
// Create a simple DataFrame, stored into a partition directory
977+
val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
978+
df1.saveAsParquetFile("data/test_table/key=1")
979+
980+
// Create another DataFrame in a new partition directory,
981+
// adding a new column and dropping an existing column
982+
val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
983+
df2.saveAsParquetFile("data/test_table/key=2")
984+
985+
// Read the partitioned table
986+
val df3 = sqlContext.parquetFile("data/test_table")
987+
df3.printSchema()
988+
989+
// The final schema consists of all 3 columns in the Parquet files together
990+
// with the partiioning column appeared in the partition directory paths.
991+
// root
992+
// |-- single: int (nullable = true)
993+
// |-- double: int (nullable = true)
994+
// |-- triple: int (nullable = true)
995+
// |-- key : int (nullable = true)
996+
{% endhighlight %}
997+
998+
</div>
999+
1000+
<div data-lang="python" markdown="1">
1001+
1002+
{% highlight python %}
1003+
# sqlContext from the previous example is used in this example.
1004+
1005+
# Create a simple DataFrame, stored into a partition directory
1006+
df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
1007+
.map(lambda i: Row(single=i, double=i * 2)))
1008+
df1.save("data/test_table/key=1", "parquet")
1009+
1010+
# Create another DataFrame in a new partition directory,
1011+
# adding a new column and dropping an existing column
1012+
df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
1013+
.map(lambda i: Row(single=i, triple=i * 3)))
1014+
df2.save("data/test_table/key=2", "parquet")
1015+
1016+
# Read the partitioned table
1017+
df3 = sqlContext.parquetFile("data/test_table")
1018+
df3.printSchema()
1019+
1020+
# The final schema consists of all 3 columns in the Parquet files together
1021+
# with the partiioning column appeared in the partition directory paths.
1022+
# root
1023+
# |-- single: int (nullable = true)
1024+
# |-- double: int (nullable = true)
1025+
# |-- triple: int (nullable = true)
1026+
# |-- key : int (nullable = true)
1027+
{% endhighlight %}
1028+
1029+
</div>
1030+
1031+
</div>
1032+
9101033
### Configuration
9111034

9121035
Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running
@@ -1429,10 +1552,10 @@ Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
14291552

14301553
You may also use the beeline script that comes with Hive.
14311554

1432-
Thrift JDBC server also supports sending thrift RPC messages over HTTP transport.
1433-
Use the following setting to enable HTTP mode as system property or in `hive-site.xml` file in `conf/`:
1555+
Thrift JDBC server also supports sending thrift RPC messages over HTTP transport.
1556+
Use the following setting to enable HTTP mode as system property or in `hive-site.xml` file in `conf/`:
14341557

1435-
hive.server2.transport.mode - Set this to value: http
1558+
hive.server2.transport.mode - Set this to value: http
14361559
hive.server2.thrift.http.port - HTTP port number fo listen on; default is 10001
14371560
hive.server2.http.endpoint - HTTP endpoint; default is cliservice
14381561

0 commit comments

Comments
 (0)