|
| 1 | +--- |
| 2 | +layout: global |
| 3 | +title: Spark SQL Programming Guide |
| 4 | +--- |
| 5 | +**Spark SQL is currently an Alpha component. Therefore, the APIs may be changed in future releases.** |
| 6 | + |
| 7 | +* This will become a table of contents (this text will be scraped). |
| 8 | +{:toc} |
| 9 | + |
| 10 | +# Overview |
| 11 | +Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using |
| 12 | +Spark. At the core of this component is a new type of RDD, |
| 13 | +[SchemaRDD](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed |
| 14 | +[Row](api/sql/catalyst/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with |
| 15 | +a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table |
| 16 | +in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet |
| 17 | +file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/). |
| 18 | + |
| 19 | +**All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell.** |
| 20 | + |
| 21 | +*************************************************************************************************** |
| 22 | + |
| 23 | +# Getting Started |
| 24 | + |
| 25 | +The entry point into all relational functionallity in Spark is the |
| 26 | +[SQLContext](api/sql/core/index.html#org.apache.spark.sql.SQLContext) class, or one of its |
| 27 | +decendents. To create a basic SQLContext, all you need is a SparkContext. |
| 28 | + |
| 29 | +{% highlight scala %} |
| 30 | +val sc: SparkContext // An existing SparkContext. |
| 31 | +val sqlContext = new org.apache.spark.sql.SQLContext(sc) |
| 32 | + |
| 33 | +// Importing the SQL context gives access to all the public SQL functions and implicit conversions. |
| 34 | +import sqlContext._ |
| 35 | +{% endhighlight %} |
| 36 | + |
| 37 | +## Running SQL on RDDs |
| 38 | +One type of table that is supported by Spark SQL is an RDD of Scala case classetees. The case class |
| 39 | +defines the schema of the table. The names of the arguments to the case class are read using |
| 40 | +reflection and become the names of the columns. Case classes can also be nested or contain complex |
| 41 | +types such as Sequences or Arrays. This RDD can be implicitly converted to a SchemaRDD and then be |
| 42 | +registered as a table. Tables can used in subsequent SQL statements. |
| 43 | + |
| 44 | +{% highlight scala %} |
| 45 | +val sqlContext = new org.apache.spark.sql.SQLContext(sc) |
| 46 | +import sqlContext._ |
| 47 | + |
| 48 | +// Define the schema using a case class. |
| 49 | +case class Person(name: String, age: Int) |
| 50 | + |
| 51 | +// Create an RDD of Person objects and register it as a table. |
| 52 | +val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) |
| 53 | +people.registerAsTable("people") |
| 54 | + |
| 55 | +// SQL statements can be run by using the sql methods provided by sqlContext. |
| 56 | +val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") |
| 57 | + |
| 58 | +// The results of SQL queries are SchemaRDDs and support all the normal RDD operations. |
| 59 | +// The columns of a row in the result can be accessed by ordinal. |
| 60 | +teenagers.map(t => "Name: " + t(0)).collect().foreach(println) |
| 61 | +{% endhighlight %} |
| 62 | + |
| 63 | +**Note that Spark SQL currently uses a very basic SQL parser, and the keywords are case sensitive.** |
| 64 | +Users that want a more complete dialect of SQL should look at the HiveQL support provided by |
| 65 | +`HiveContext`. |
| 66 | + |
| 67 | +## Using Parquet |
| 68 | + |
| 69 | +Parquet is a columnar format that is supported by many other data processing systems. Spark SQL |
| 70 | +provides support for both reading and writing parquet files that automatically preserves the schema |
| 71 | +of the original data. Using the data from the above example: |
| 72 | + |
| 73 | +{% highlight scala %} |
| 74 | +val sqlContext = new org.apache.spark.sql.SQLContext(sc) |
| 75 | +import sqlContext._ |
| 76 | + |
| 77 | +val people: RDD[Person] // An RDD of case class objects, from the previous example. |
| 78 | + |
| 79 | +// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using parquet. |
| 80 | +people.saveAsParquetFile("people.parquet") |
| 81 | + |
| 82 | +// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved. |
| 83 | +// The result of loading a parquet file is also a SchemaRDD. |
| 84 | +val parquetFile = sqlContext.parquetFile("people.parquet") |
| 85 | + |
| 86 | +//Parquet files can also be registered as tables and then used in SQL statements. |
| 87 | +parquetFile.registerAsTable("parquetFile") |
| 88 | +val teenagers = sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") |
| 89 | +teenagers.collect().foreach(println) |
| 90 | +{% endhighlight %} |
| 91 | + |
| 92 | +## Writing Language-Integrated Relational Queries |
| 93 | + |
| 94 | +Spark SQL also supports a domain specific language for writing queries. Once again, |
| 95 | +using the data from the above examples: |
| 96 | + |
| 97 | +{% highlight scala %} |
| 98 | +val sqlContext = new org.apache.spark.sql.SQLContext(sc) |
| 99 | +import sqlContext._ |
| 100 | +val people: RDD[Person] // An RDD of case class objects, from the first example. |
| 101 | + |
| 102 | +// The following is the same as 'SELECT name FROM people WHERE age >= 10 AND age <= 19' |
| 103 | +val teenagers = people.where('age >= 10).where('age <= 19).select('name) |
| 104 | +{% endhighlight %} |
| 105 | + |
| 106 | +The DSL uses Scala symbols to represent columns in the underlying table, which are identifiers |
| 107 | +prefixed with a tick (`'`). Implicit conversions turn these symbols into expressions that are |
| 108 | +evaluated by the SQL execution engine. A full list of the functions supported can be found in the |
| 109 | +[ScalaDoc](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD). |
| 110 | + |
| 111 | +<!-- TODO: Include the table of operations here. --> |
| 112 | + |
| 113 | +# Hive Support |
| 114 | + |
| 115 | +Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/). |
| 116 | +However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. |
| 117 | +In order to use Hive you must first run '`sbt/sbt hive/assembly`'. This command builds a new assembly |
| 118 | +jar that includes Hive. When this jar is present, Spark will use the Hive |
| 119 | +assembly instead of the normal Spark assembly. Note that this Hive assembly jar must also be present |
| 120 | +on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries |
| 121 | +(SerDes) in order to acccess data stored in Hive. |
| 122 | + |
| 123 | +Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. |
| 124 | + |
| 125 | +When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and |
| 126 | +adds support for finding tables in in the MetaStore and writing queries using HiveQL. Users who do |
| 127 | +not have an existing Hive deployment can also experiment with the `LocalHiveContext`, |
| 128 | +which is similar to `HiveContext`, but creates a local copy of the `metastore` and `warehouse` |
| 129 | +automatically. |
| 130 | + |
| 131 | +{% highlight scala %} |
| 132 | +val sc: SparkContext // An existing SparkContext. |
| 133 | +val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) |
| 134 | + |
| 135 | +// Importing the SQL context gives access to all the public SQL functions and implicit conversions. |
| 136 | +import hiveContext._ |
| 137 | + |
| 138 | +sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") |
| 139 | +sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src") |
| 140 | + |
| 141 | +// Queries are expressed in HiveQL |
| 142 | +sql("SELECT key, value FROM src").collect().foreach(println) |
| 143 | +{% endhighlight %} |
0 commit comments