============== Asynchronous Scala DSL for Cassandra
The current version is: val phantomVersion = 0.8.0
.
Phantom is published to Maven Central and it's actively and avidly developed.
- Issues and questions
- Commercial support
- Integrating phantom in your project
-
phantom columns
- Data modeling with phantom
-
Basic query examples
- Automated schema generation
-
Cassandra indexing
- Asynchronous iterators
-
Batch statements
- Thrift integration
- Running the tests locally
-
- Using GitFlow as a branching model
- Scala style guidelines for contributions
- Using the testing utilities to write tests
- Copyright
We love Cassandra to bits and use it in every bit our stack. phantom makes it super trivial for Scala users to embrace Cassandra.
Cassandra is highly scalable and it's by far the most powerful database technology available, open source or otherwise.
Phantom is built on top of the Datastax Java Driver, which does most of the heavy lifting.
If you're completely new to Cassandra, a much better place to start is the Datastax Introduction to Cassandra
We are very happy to help implement missing features in phantom, answer questions about phantom, and occasionally help you out with Cassandra questions, although do note we're a bit short staffed!
You can get in touch via the newzly-phantom Google Group.
We, the people behind phantom run a software development house specialised in Scala, NoSQL and distributed systems. If you are after enterprise grade training or support for using phantom, Websudos is here to help!
We offer a comprehensive range of services, including but not limited to:
Training In house software development Team building Architecture planning Startup product development
We are big fans of open source and we will open source every project we can! To read more about our OSS efforts, click here.
For most things, all you need is phantom-dsl
. Read through for information on other modules.
libraryDependencies ++= Seq(
"com.newzly" %% "phantom-dsl" % phantomVersion
)
The full list of available modules is:
libraryDependencies ++= Seq(
"com.newzly" %% "phantom-dsl" % phantomVersion,
"com.newzly" %% "phantom-cassandra-unit" % phantomVersion,
"com.newzly" %% "phantom-example" % phantomVersion,
"com.newzly" %% "phantom-thrift" % phantomVersion,
"com.newzly" %% "phantom-test" % phantomVersion
)
This is the list of available columns and how they map to C* data types.
This also includes the newly introduced static
columns in C* 2.0.6.
The type of a static column can be any of the allowed primitive Cassandra types. phantom won't let you mixin a non-primitive via implicit magic.
phantom columns | Java/Scala type | Cassandra type |
---|---|---|
BigDecimalColumn | scala.math.BigDecimal | decimal |
BigIntColumn | scala.math.BigInt | varint |
BooleanColumn | scala.Boolean | boolean |
DateColumn | java.util.Date | timestamp |
DateTimeColumn | org.joda.time.DateTime | timestamp |
DoubleColumn | scala.Double | double |
FloatColumn | scala.Float | float |
IntColumn | scala.Int | int |
InetAddressColumn | java.net.InetAddress | inet |
LongColumn | scala.Long | long |
StringColumn | java.lang.String | text |
UUIDColumn | java.util.UUID | uuid |
TimeUUIDColumn | java.util.UUID | timeuuid |
CounterColumn | scala.Long | counter |
StaticColumn<type> | <type> | type static |
Optional columns allow you to set a column to a null
or a None
. Use them when you really want something to be optional.
The outcome is that instead of a T
you get an Option[T]
and you can match, fold, flatMap, map
on a None
.
The Optional
part is handled at a DSL level, it's not translated to Cassandra in any way.
phantom columns | Java/Scala type | Cassandra columns |
---|---|---|
OptionalBigDecimalColumn | Option[scala.math.BigDecimal] | decimal |
OptionalBigIntColumn | Option[scala.math.BigInt] | varint |
OptionalBooleanColumn | Option[scala.Boolean] | boolean |
OptionalDateColumn | Option[java.util.Date] | timestamp |
OptionalDateTimeColumn | Option[org.joda.time.DateTime] | timestamp |
OptionalDoubleColumn | Option[scala.Double] | double |
OptionalFloatColumn | Option[scala.Float] | float |
OptionalIntColumn | Option[scala.Int] | int |
OptionalInetAddressColumn | Option[java.net.InetAddress] | inet |
OptionalLongColumn | Option[Long] | long |
OptionalStringColumn | Option[java.lang.String] | text |
OptionalUUIDColumn | Option[java.util.UUID] | uuid |
OptionalTimeUUID | Option[java.util.UUID] | timeuuid |
Cassandra collections do not allow custom data types. Storing JSON as a string is possible, but it's still a text
column as far as Cassandra is concerned.
The type
in the below example is always a default C* type.
phantom columns | Cassandra columns |
---|---|
ListColumn.<type> | list<type> |
SetColumn.<type> | set<type> |
MapColumn.<type, type> | map<type, type> |
phantom uses a specific set of traits to enforce more advanced Cassandra limitations and schema rules at compile time.
This is the default partitioning key of the table, telling Cassandra how to divide data into partitions and store them accordingly. You must define at least one partition key for a table. Phantom will gently remind you of this with a fatal error.
If you use a single partition key, the PartitionKey
will always be the first PrimaryKey
in the schema.
It looks like this in CQL: PRIMARY_KEY(your_partition_key, primary_key_1, primary_key_2)
.
Using more than one PartitionKey[T]
in your schema definition will output a Composite Key in Cassandra.
PRIMARY_KEY((your_partition_key_1, your_partition_key2), primary_key_1, primary_key_2)
.
As it's name says, using this will mark a column as PrimaryKey
. Using multiple values will result in a Compound Value.
The first PrimaryKey
is used to partition data. phantom will force you to always define a PartitionKey
so you don't forget
about how your data is partitioned. We also use this DSL restriction because we hope to do more clever things with it in the future.
A compound key in C* looks like this:
PRIMARY_KEY(primary_key, primary_key_1, primary_key_2)
.
Before you add too many of these, remember they all have to go into a where
clause.
You can only query with a full primary key, even if it's compound. phantom can't yet give you a compile time error for this, but Cassandra will give you a runtime one.
This is a SecondaryIndex in Cassandra. It can help you enable querying really fast, but it's not exactly high performance. It's generally best to avoid it, we implemented it to show off what good guys we are.
When you mix in Index[T]
on a column, phantom will let you use it in a where
clause.
However, don't forget to allowFiltering
for such queries, otherwise C* will give you an error.
This can be used with either java.util.Date
or org.joda.time.DateTime
. It tells Cassandra to store records in a certain order based on this field.
An example might be: object timestamp extends DateTimeColumn(this) with ClusteringOrder[DateTime] with Ascending
To fully define a clustering column, you MUST also mixin either Ascending
or Descending
to indicate the sorting order.
These columns are especially useful if you are building Thrift services. They are deeply integrated with Twitter Scrooge and relevant to the Twitter ecosystem(Finagle, Zipkin, Storm etc)
They are available via the phantom-thrift
module and you need to import com.newzly.phantom.thrift.Implicits._
to get them.
In the below scenario, the C* type is always text and the type you need to pass to the column is a Thrift struct, specifically com.twitter.scrooge.ThriftStruct
.
phantom will use a CompactThriftSerializer
, store the record as a binary string and then reparse it on fetch.
Thrift serialization and de-serialization is extremely fast, so you don't need to worry about speed or performance overhead. You generally use these to store collections(small number of items), not big things.
phantom columns | Cassandra columns |
---|---|
ThriftColumn.<type> | text |
ThriftListColumn.<type> | list<text> |
ThriftSetColumn.<type> | set<text> |
ThriftMapColumn.<type, type> | map<text, text> |
import java.util.{ Date, UUID }
import com.datastax.driver.core.Row
import com.newzly.phantom.sample.ExampleModel
import com.newzly.phantom.Implicits._
case class ExampleModel (
id: Int,
name: String,
props: Map[String, String],
timestamp: Int,
test: Option[Int]
)
sealed class ExampleRecord extends CassandraTable[ExampleRecord, ExampleModel] {
object id extends UUIDColumn(this) with PartitionKey[UUID]
object timestamp extends DateTimeColumn(this) with ClusteringOrder with Ascending
object name extends StringColumn(this)
object props extends MapColumn[ExampleRecord, ExampleModel, String, String](this)
object test extends OptionalIntColumn(this)
override def fromRow(row: Row): ExampleModel = {
ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
}
}
The query syntax is inspired by the Foursquare Rogue library and aims to replicate CQL 3 as much as possible.
Phantom works with both Scala Futures and Twitter Futures as first class citizens.
Method name | Description |
---|---|
where |
The WHERE clause in CQL |
and |
Chains several clauses, creating a WHERE ... AND query |
orderBy |
Adds an ORDER_BY column_name to the query |
allowFiltering |
Allows Cassandra to filter records in memory. This is an expensive operation. |
useConsistencyLevel |
Sets the consistency level to use. |
setFetchSize |
Sets the maximum number of records to retrieve. Default is 10000 |
limit |
Sets the exact number of records to retrieve. |
Select queries are very straightforward and enforce most limitations at compile time.
Operator name | Description |
---|---|
eqs | The "equals" operator. Will match if the objects are equal |
in | The "in" operator. Will match if the object is found the list of arguments |
gt | The "greater than" operator. Will match a the record is greater than the argument and exists |
gte | The "greater than or equals" operator. Will match a the record is greater than the argument and exists |
lt | The "lower than" operator. Will match a the record that is less than the argument and exists |
lte | The "lower than or equals" operator. Will match a the record that is less than the argument and exists |
All partial select queries will return Tuples and are therefore limited to 22 fields. We haven't yet bothered to add more than 10 fields in the select, but you can always do a Pull Request. The file you are looking for is here. The 22 field limitation will change in Scala 2.11 and phantom will be updated once cross version compilation is enabled.
def getNameById(id: UUID): Future[Option[String]] = {
ExampleRecord.select(_.name).where(_.id eqs someId).one()
}
def getNameAndPropsById(id: UUID): Future[Option(String, Map[String, String])] {
ExampleRecord.select(_.name, _.props).where(_.id eqs someId).one()
}
Method name | Description |
---|---|
value |
A type safe Insert query builder. Throws an error for null values. |
valueOrNull |
This will accept a null without throwing an error. |
useConsistencyLevel |
Sets the consistency level to use. |
ttl |
Sets the "Time-To-Live" for the record. |
Method name | Description |
---|---|
where |
The WHERE clause in CQL |
and |
Chains several clauses, creating a WHERE ... AND query |
modify |
The actual update query builder |
useConsistencyLevel |
Sets the consistency level to use. |
onflyIf |
Addition update condition. Used on non-primary columns |
Method name | Description |
---|---|
where |
The WHERE clause in CQL |
useConsistencyLevel |
Sets the consistency level to use. |
The full list can be found in CQLQuery.scala.
Method name | Description |
---|---|
tracing_= |
The Cassandra utility method. Enables or disables tracing. |
queryString |
Get the output CQL 3 query of a phantom query. |
consistencyLevel |
Retrieves the consistency level in use. |
consistencyLevel_= |
Sets the consistency level to use. |
retryPolicy |
Retrieves the RetryPolicy in use. |
retryPolicy_= |
Sets the RetryPolicy to use. |
serialConsistencyLevel |
Retrieves the serial consistency level in use. |
serialConsistencyLevel_= |
Sets the serial consistency level to use. |
forceNoValues_= |
Sets the serial consistency level to use. |
routingKey |
Retrieves the Routing Key as a ByteBuffer. |
ExampleRecord.select.one() // When you only want to select one record
ExampleRecord.update.where(_.name eqs name).modify(_.name setTo "someOtherName").future() // When you don't care about the return type.
ExampleRecord.select.fetchEnumerator // when you need an Enumerator
ExampleRecord.select.fetch // When you want to fetch a Seq[Record]
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object ExampleRecord extends ExampleRecord {
override val tableName = "examplerecord"
// now define a session, a normal Datastax cluster connection
implicit val session = SomeCassandraClient.session;
def getRecordsByName(name: String): Future[Seq[ExampleModel]] = {
ExampleRecord.select.where(_.name eqs name).fetch
}
def getOneRecordByName(name: String, someId: UUID): Future[Option[ExampleModel]] = {
ExampleRecord.select.where(_.name eqs name).and(_.id eqs someId).one()
}
}
ExampleRecord.select.get() // When you only want to select one record
ExampleRecord.update.where(_.name eqs name).modify(_.name setTo "someOtherName").execute() // When you don't care about the return type.
ExampleRecord.select.enumerate // when you need an Enumerator
ExampleRecord.select.collect // When you want to fetch a Seq[Record]
import com.twitter.util.Future
object ExampleRecord extends ExampleRecord {
override val tableName = "examplerecord"
// now define a session, a normal Datastax cluster connection
implicit val session = SomeCassandraClient.session;
def getRecordsByName(name: String): Future[Seq[ExampleModel]] = {
ExampleRecord.select.where(_.name eqs name).collect
}
def getOneRecordByName(name: String, someId: UUID): Future[Option[ExampleModel]] = {
ExampleRecord.select.where(_.name eqs name).and(_.id eqs someId).get()
}
}
Based on the above list of columns, phantom supports CQL 3 modify operations for CQL 3 collections: list, set, map
.
All operators will be available in an update query, specifically:
ExampleRecord.update.where(_.id eqs someId).modify(_.someList $OPERATOR $args).future()
.
Examples in ListOperatorsTest.scala.
Name | Description |
---|---|
prepend |
Adds an item to the head of the list |
prependAll |
Adds multiple items to the head of the list |
append |
Adds an item to the tail of the list |
appendAll |
Adds multiple items to the tail of the list |
discard |
Removes the given item from the list. |
discardAll |
Removes all given items from the list. |
setIdIx |
Updates a specific index in the list |
Sets have a better performance than lists, as the Cassandra documentation suggests. Examples in SetOperationsTest.scala.
Name | Description |
---|---|
add |
Adds an item to the tail of the set |
addAll |
Adds multiple items to the tail of the set |
remove |
Removes the given item from the set. |
removeAll |
Removes all given items from the set. |
Both the key and value types of a Map must be Cassandra primitives. Examples in MapOperationsTest.scala:
Name | Description |
---|---|
put |
Adds an (key -> value) pair to the map |
putAll |
Adds multiple (key -> value) pairs to the map |
Replication strategies and more advanced features are not yet available in phantom, but CQL 3 Table schemas are automatically generated from the Scala code. To create a schema in Cassandra from a table definition:
import scala.concurrent.Await
import scala.concurrent.duration._
Await.result(ExampleRecord.create().future(), 5000 millis)
Of course, you don't have to block unless you want to.
import scala.concurrent.Await
import scala.concurrent.duration._
import com.newzly.phantom.Implicits._
sealed class ExampleRecord2 extends CassandraTable[ExampleRecord2, ExampleModel] with LongOrderKey[ExampleRecod2, ExampleRecord] {
object id extends UUIDColumn(this) with PartitionKey[UUID]
object timestamp extends DateTimeColumn(this)
object name extends StringColumn(this)
object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
object test extends OptionalIntColumn(this)
override def fromRow(row: Row): ExampleModel = {
ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
}
}
val orderedResult = Await.result(Articles.select.where(_.id gtToken one.get.id ).fetch, 5000 millis)
Operator name | Description |
---|---|
eqsToken | The "equals" operator. Will match if the objects are equal |
gtToken | The "greater than" operator. Will match a the record is greater than the argument |
gteToken | The "greater than or equals" operator. Will match a the record is greater than the argument |
ltToken | The "lower than" operator. Will match a the record that is less than the argument and exists |
lteToken | The "lower than or equals" operator. Will match a the record that is less than the argument |
For more details on how to use Cassandra partition tokens, see SkipRecordsByToken.scala
phantom supports Cassandra Time Series with both java.util.Date
and org.joda.time.DateTime
. To use them, simply mixin com.newzly.phantom.keys.ClusteringOrder
and either Ascending
or Descending
.
Restrictions are enforced at compile time.
import com.newzly.phantom.Implicits._
sealed class ExampleRecord3 extends CassandraTable[ExampleRecord3, ExampleModel] with LongOrderKey[ExampleRecod3, ExampleRecord] {
object id extends UUIDColumn(this) with PartitionKey[UUID]
object timestamp extends DateTimeColumn(this) with ClusteringOrder with Ascending
object name extends StringColumn(this)
object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
object test extends OptionalIntColumn(this)
override def fromRow(row: Row): ExampleModel = {
ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
}
}
Automatic schema generation can do all the setup for you.
Phantom also supports using Compound keys out of the box. The schema can once again by auto-generated.
A table can have only one PartitionKey
but several PrimaryKey
definitions. Phantom will use these keys to build a compound value. Example scenario, with the compound key: (id, timestamp, name)
import org.joda.time.DateTime
import com.newzly.phantom.Implicits._
sealed class ExampleRecord3 extends CassandraTable[ExampleRecord3, ExampleModel] with LongOrderKey[ExampleRecod3, ExampleRecord] {
object id extends UUIDColumn(this) with PartitionKey[UUID]
object timestamp extends DateTimeColumn(this) with PrimaryKey[DateTime]
object name extends StringColumn(this) with PrimaryKey[String]
object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
object test extends OptionalIntColumn(this)
override def fromRow(row: Row): ExampleModel = {
ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
}
}
When you want to use a column in a where
clause, you need an index on it. Cassandra data modeling is out of the scope of this writing, but phantom offers com.newzly.phantom.keys.Index
to enable querying.
The CQL 3 schema for secondary indexes can also be auto-generated with ExampleRecord4.create()
.
SELECT
is the only query you can perform with an Index
column. This is a Cassandra limitation. The relevant tests are found here.
import java.util.UUID
import org.joda.time.DateTime
import com.newzly.phantom.Implicits._
sealed class ExampleRecord4 extends CassandraTable[ExampleRecord4, ExampleModel] with LongOrderKey[ExampleRecod4, ExampleRecord] {
object id extends UUIDColumn(this) with PartitionKey[UUID]
object timestamp extends DateTimeColumn(this) with Index[DateTime]
object name extends StringColumn(this) with Index[String]
object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
object test extends OptionalIntColumn(this)
override def fromRow(row: Row): ExampleModel = {
ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
}
}
Phantom comes packed with CQL rows asynchronous lazy iterators to help you deal with billions of records. phantom iterators are based on Play iterators with very lightweight integration.
The functionality is identical with respect to asynchronous, lazy behaviour and available methods. For more on this, see this Play tutorial
Usage is trivial. If you want to use slice, take or drop
with iterators, the partitioner needs to be ordered.
import scala.concurrent.Await
import scala.concurrent.duration._
import org.joda.time.DateTime
import com.newzly.phantom.Implicits._
sealed class ExampleRecord3 extends CassandraTable[ExampleRecord3, ExampleModel] with LongOrderKey[ExampleRecord3, ExampleRecord] {
object id extends UUIDColumn(this) with PartitionKey[UUID]
object timestamp extends DateTimeColumn(this) with PrimaryKey[DateTime]
object name extends StringColumn(this) with PrimaryKey[String]
object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
object test extends OptionalIntColumn(this)
override def fromRow(row: Row): ExampleModel = {
ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
}
}
object ExampleRecord3 extends ExampleRecord3 {
def getRecords(start: Int, limit: Int): Future[Set[ExampleModel]] = {
select.fetchEnumerator.slice(start, limit).collect
}
}
phantom also brrings in support for batch statements. To use them, see IterateeBigTest.scala
We have tested with 10,000 statements per batch, and 1000 batches processed simulatenously. Before you run the test, beware that it takes ~40 minutes.
Batches use lazy iterators and daisy chain them to offer thread safe behaviour. They are not memory intensive and you can expect consistent processing speed even with 1 000 000 statements per batch.
Batches are immutable and adding a new record will result in a new Batch, just like most things Scala, so be careful to chain the calls.
phantom also supports COUNTER batch updates and UNLOGGED batch updates.
import com.newzly.phantom.Implicits._
BatchStatement()
.add(ExampleRecord.update.where(_.id eqs someId).modify(_.name setTo "blabla"))
.add(ExampleRecord.update.where(_.id eqs someOtherId).modify(_.name setTo "blabla2"))
.future()
import com.newzly.phantom.Implicits._
CounterBatchStatement()
.add(ExampleRecord.update.where(_.id eqs someId).modify(_.someCounter increment 500L))
.add(ExampleRecord.update.where(_.id eqs someOtherId).modify(_.someCounter decrement 300L))
.future()
import com.newzly.phantom.Implicits._
UnloggedBatchStatement()
.add(ExampleRecord.update.where(_.id eqs someId).modify(_.name setTo "blabla"))
.add(ExampleRecord.update.where(_.id eqs someOtherId).modify(_.name setTo "blabla2"))
.future()
We use Apache Thrift extensively for our backend services. phantom
is very easy to integrate with Thrift models and uses Twitter Scrooge
to compile them. Thrift integration is optional and available via "com.newzly" %% "phantom-thrift" % phantomVersion
.
namespace java com.newzly.phantom.sample.ExampleModel
stuct ExampleModel {
1: required i32 id,
2: required string name,
3: required Map<string, string> props,
4: required i32 timestamp
5: optional i32 test
}
phantom uses Embedded Cassandra to run tests without a local Cassandra server running. You need two terminals to run the tests, one for Embedded Cassandra and one for the actual tests.
sbt
project phantom-cassandra-unit
run
Then in a new terminal
sbt
project phantom-test
test
Phantom was developed at newzly as an in-house project. All Cassandra integration at newzly goes through phantom.
- Flavian Alexandru flavian@newzly.com(maintainer)
- Tomasz Perek tomasz.perek@newzly.com
- Viktor Taranenko (viktortnk)
- Bartosz Jankiewicz (@bjankie1)
- Eugene Zhulenev (@ezhulenev)
Special thanks to Viktor Taranenko from WhiskLabs, who gave us the original idea.
Copyright 2013 WhiskLabs, Copyright 2013 - 2014 newzly.
Contributions are most welcome!
To contribute, simply submit a "Pull request" via GitHub.
We use GitFlow as a branching model and SemVer for versioning.
- When you submit a "Pull request" we require all changes to be squashed.
- We never merge more than one commit at a time. All the n commits on your feature branch must be squashed.
- We won't look at the pull request until Travis CI says the tests pass, make sure tests go well.
In spirit, we follow the Twitter Scala Style Guidelines. We will reject your pull request if it doesn't meet code standards, but we'll happily give you a hand to get it right.
Some of the things that will make us seriously frown:
- Blocking when you don't have to. It just makes our eyes hurt when we see useless blocking.
- Testing should be thread safe and fully async, use
ParallelTestExecution
if you want to show off. - Use the common patterns you already see here, we've done a lot of work to make it easy.
- Don't randomly import stuff. We are very big on alphabetized clean imports.