pub/pig_and_cassandra.md at master · stinkymatt/pub · GitHub

Pig & Cassandra

Outline for CassandraDC Meetup Talk, 07/06/2011

What's Pig?

 -- Counts all the rows in a column family  
 rows = LOAD 'cassandra://ks/cf' USING CassandraStorage();  
 countthis = GROUP rows ALL;  
 countedrows = FOREACH countthis GENERATE COUNT(rows.$0);  
 dump countedrows;

Hive:SQL :: Pig:SQL execution plan

How does it work with Cassandra?

Leverages Cassandra/Hadoop MapRed classes (obviously)
- ColumnFamilyInputFormat
- ColumnfamilyOutputFormat
- CassandraStorage Load/Store Func.
  STORE stuff INTO 'cassandra://ks/cf' using CassandraStorage();

Configuration can be tricky

In mapred-site.xml:

 <property>  
 	<name>cassandra.thrift.port</name>  
 	<value>9160</value>  
 </property>  
 <property>  
 	<name>cassandra.thrift.address</name>  
 	<value>hostname</value>  
 </property>  
 <property>  
 	<name>cassandra.partitioner.class</name>  
 	<value>org.apache.cassandra.dht.RandomPartitioner</value>  
 </property>

Suggest giving Brisk a spin.

Real-World use
- Only useful if you have to iterate over ALL your cassandra data AND
- You don't have a copy of the data in Hadoop
- Why?
  - If data is also in Hadoop, HDFS offers much faster reads due to much larger sequential blocks.
  - Right now, you can't narrow your pig job with an indexed search
  - When that's fixed, pig will really be useful. (Hive too)
Resources
- SVN
- $BRISK_HOME/resources/pig/examples

© 2011 Matt Kennedy, All Rights Reserved.