A library for parsing and querying CSV data with Spark SQL.
This library requires Spark 1.2+
You can link against this library in your program at the following coordiates:
groupId: com.databricks.spark
artifactId: spark-csv_2.10
version: 0.1
The spark-csv assembly jar file can also be added to a Spark using the --jars
command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --jars spark-csv-assembly-0.1.jar
These examples use a CSV file available for download here:
$ wget https://github.com/databricks/spark-csv/raw/master/src/test/resources/cars.csv
You can use the library by loading the implicits from com.databricks.spark.csv._
.
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import com.databricks.spark.csv._
val cars = sqlContext.csvFile("cars.csv")
CSV data can be queried in pure SQL by registering the data as a temporary table.
CREATE TEMPORARY TABLE cars
USING com.databricks.spark.csv
OPTIONS (path "cars.csv", header "true")
CSV files can be read using functions in CsvUtils.
import com.databricks.spark.csv.CSVUtils;
JavaSchemaRDD cars = (new CsvUtils()).setUseHeader(true).csvFile(sqlContext, "cars.csv");
This library is built with SBT, which is automatically downloaded by the included shell script. To build a JAR file simply run sbt/sbt assembly
from the project root.