Outline for CassandraDC Meetup Talk, 07/06/2011
-
What's Pig?
-
-- Counts all the rows in a column family rows = LOAD 'cassandra://ks/cf' USING CassandraStorage(); countthis = GROUP rows ALL; countedrows = FOREACH countthis GENERATE COUNT(rows.$0); dump countedrows;
-
Hive:SQL :: Pig:SQL execution plan
-
-
How does it work with Cassandra?
-
Leverages Cassandra/Hadoop MapRed classes (obviously)
- ColumnFamilyInputFormat
- ColumnfamilyOutputFormat
- CassandraStorage Load/Store Func.
STORE stuff INTO 'cassandra://ks/cf' using CassandraStorage();
-
Configuration can be tricky
- In mapred-site.xml:
<property> <name>cassandra.thrift.port</name> <value>9160</value> </property> <property> <name>cassandra.thrift.address</name> <value>hostname</value> </property> <property> <name>cassandra.partitioner.class</name> <value>org.apache.cassandra.dht.RandomPartitioner</value> </property>
- Suggest giving Brisk a spin.
-
-
Real-World use
- Only useful if you have to iterate over ALL your cassandra data AND
- You don't have a copy of the data in Hadoop
- Why?
- If data is also in Hadoop, HDFS offers much faster reads due to much larger sequential blocks.
- Right now, you can't narrow your pig job with an indexed search
- When that's fixed, pig will really be useful. (Hive too)
-
Resources
- SVN
- $BRISK_HOME/resources/pig/examples
© 2011 Matt Kennedy, All Rights Reserved.