Utilities and examples to asssist in working with Cassandra and PySpark.
Currently contains an updated and much more robust example of using a
SparkContext's newAPIHadoopRDD
to read from and an RDD's saveAsNewAPIHadoopDataset
to write to Cassandra 2.1. Demonstrates usage of CQL collections:
lists,
sets and
maps.
Working on proper integration with the DataStax Cassandra Spark Connector.
You'll need Maven in order to build the uberjar required for the examples.
mvn clean package
Will create an uberjar at target/pyspark-cassandra-<version>-SNAPSHOT.jar
.
spark-submit --driver-class-path /path/to/pyspark-cassandra.jar myscript.py ...
pip install -r requirements.txt
Then run examples either directly with spark-submit
, or use the
run_script.py
utility.
The example can first create the schema it requires via:
./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py init test
The init command initializes the keyspace, table and inserts sample data.
"test"
is the name of the keyspace. A users table will be created in
this keyspace with two sample users to enable reading.
Afterwards, you can run:
./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test
Which runs a sample PySpark driver program that reads the existing values in
the users
table and then writes two new users to this table.