Skip to content

Getting familiar with accessing Cassandra from Python

Notifications You must be signed in to change notification settings

mramshaw/Python_Cassandra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cassandra with Python

Cassandra

Cassandra is a NoSQL database that originated at Facebook.

Cassandra is optimized for fast writes and fast reads over very large volumes of data.

In contrast with traditional databases that journal database changes and then write them to disk, Cassandra journals database changes and then writes them to a write-back cache (also known as a write-behind cache) - and only writes the cache to disk once the cache fills.

    Journal --> Cache --> Disk 

The Cassandra terms for these are the commit log, Memtables and SS Tables [which stands for Sorted String Tables; these are sorted in row order and are immutable]. The database write is successful and returns once the data is written to the Memtable. How this data gets written to disk and propagated then depends on the replication policy (we will use simple replication).

As SS Tables are immutable, deletes are handled via a logical delete indicator, which is referred to as a Tombstone in Cassandra. Compaction is used to remove logically deleted records [the uncompacted original SS Table continues to exist until the JVM runs GC (garbage collection)].

By design, there is no single point of failure.

In terms of the CAP or Brewer's theorem, Cassandra is an eventually-consistent database. This means that replicas of a row may have different versions of the data - but only for brief periods. The replicas will eventually be synchronized and become consistent (hence the term).

CAP and Cassandra

[This is a slight over-simplification, as Cassandra can be extensively tuned for performance/consistency.]

Motivation

Familiarization with Cassandra and cql with Python, using the Datastax driver.

This exercise follows on from my Replicated Cassandra Database exercise.

Contents

The content are as follows:

Prerequisites

  • Python installed

  • pip installed

Cassandra driver

The installation of the Cassandra driver (for Python) is slightly involved.

There are also optional components (including non-Python components).

Installation

Install the Cassandra driver as follows:

$ pip install --user cassandra-driver

Or else:

$ pip install --user -r requirements.txt

[This will also install some optional components, as discussed below.]

Verification

Verify installation as follows:

$ python -c 'import cassandra; print cassandra.__version__'
3.16.0
$

Or:

$ pip list --format=freeze | grep cassandra-driver
cassandra-driver==3.16.0
$

Compression

Optionally, install lz4 (gets installed with cassandra-driver if using requirements.txt):

$ pip install --user lz4

Verify installation as follows:

$ python -c 'import lz4; print lz4.__version__'
2.1.2
$

Or:

$ pip list --format=freeze | grep lz4
2.1.2
$

Metrics

Optionally, install scales (gets installed with cassandra-driver if using requirements.txt):

$ pip install --user scales

The driver has built-in support for capturing Cluster.metrics about the queries run. The scales library is required to support this.

Performance

Optionally, install libev for better performance.

Verify the presence (or - as below - absence) of libev as follows:

$ python -c 'from cassandra.io.libevreactor import LibevConnection'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/owner/.local/lib/python2.7/site-packages/cassandra/io/libevreactor.py", line 33, in <module>
    "The C extension needed to use libev was not found.  This "
ImportError: The C extension needed to use libev was not found.  This probably means that you didn't have the required build dependencies when installing the driver.  See http://datastax.github.io/python-driver/installation.html#c-extensions for instructions on installing build dependencies and building the C extension.
$

Installation instructions are here:

http://datastax.github.io/python-driver/installation.html#libev-support

[We will not be installing libev.]

Running Cassandra

We will test everything first with Docker and cqlsh and then we will use Python code to access our running Cassandra.

To make things clearer, pull the latest tagged Cassandra image, as follows:

$ docker pull cassandra:3.11.3

[The current version is 3.11.3 as of this writing, but may change over time.]

Run Cassandra with Docker

[We will use Docker linking to expose Cassandra.]

Run Cassandra as follows:

$ docker run --name python-cassandra cassandra:3.11.3

[We could run this detached with the -d option, but then we would have to tail the log with docker logs python-cassandra. As it is, the log will be produced in this console, allowing us to watch both consoles at the same time.]

In another console, set up a current directory environment variable as follows:

$ export PWD=`pwd`

Run cqlsh as follows:

$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql

It should look more or less as follows:

$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql

CREATE TABLE k8s_test.users (
    username text PRIMARY KEY,
    password text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';


 username | password
----------+----------
    Jesse |   secret
    Frank | password

(2 rows)
$

[Note that Cassandra has defaulted a lot of the table values for us. Here the default Compaction Strategy is Size-Tiered, which seems appropriate for the current use case - where the records will be written once.]

In the event it looks as follows, Cassandra probably has not fully started (and it may be necessary to retry):

$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql
Connection error: ('Unable to connect to any servers', {'172.17.0.2': error(111, "Tried connecting to [('172.17.0.2', 9042)]. Last error: Connection refused")})
$

Now we can kill Cassandra in the original console with Ctrl-C. Once it has stopped, remove python-cassandra:

$ docker rm python-cassandra

Clean up the data volumes as follows:

$ docker volume prune

Run Cassandra with Python

[We will use Docker port-mapping to expose Cassandra; port 9042 must be available on the local machine.]

Run Cassandra as follows:

$ docker run --name python-cassandra -p 9042:9042 cassandra:3.11.3

In another console, set up a current directory environment variable as follows:

$ export PWD=`pwd`

Run cqlsh to set up our keyspace and table as follows:

$ docker run -it --link python-cassandra:cassandra --rm -v $PWD/cql:/cql cassandra:3.11.3 cqlsh cassandra -f /cql/users.cql

[This will leave our table empty.]

Run command python add_users.py to add some users. This should look like:

$ python add_users.py
2018-12-16 21:19:34,667 [INFO] cassandra.policies: Using datacenter 'datacenter1' for DCAwareRoundRobinPolicy (via host '127.0.0.1'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
2018-12-16 21:19:34,721 [INFO] root: Created user: user_0
2018-12-16 21:19:34,723 [INFO] root: Created user: user_1
2018-12-16 21:19:34,725 [INFO] root: Created user: user_2
2018-12-16 21:19:34,727 [INFO] root: Created user: user_3
2018-12-16 21:19:34,728 [INFO] root: Created user: user_4
2018-12-16 21:19:34,730 [INFO] root: Created user: user_5
2018-12-16 21:19:34,731 [INFO] root: Created user: user_6
2018-12-16 21:19:34,732 [INFO] root: Created user: user_7
2018-12-16 21:19:34,733 [INFO] root: Created user: user_8
2018-12-16 21:19:34,734 [INFO] root: Created user: user_9
2018-12-16 21:19:34,734 [INFO] root: 10 users added
$

Run command python list_users.py to list some users. This should look like:

$ python list_users.py
2018-12-16 21:26:35,618 [INFO] cassandra.policies: Using datacenter 'datacenter1' for DCAwareRoundRobinPolicy (via host '127.0.0.1'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Row(username=u'user_7', password=u'password_7')
Row(username=u'user_6', password=u'password_6')
Row(username=u'user_1', password=u'password_1')
Row(username=u'user_2', password=u'password_2')
Row(username=u'user_4', password=u'password_4')
Row(username=u'user_9', password=u'password_9')
Row(username=u'user_3', password=u'password_3')
Row(username=u'user_8', password=u'password_8')
Row(username=u'user_5', password=u'password_5')
Row(username=u'user_0', password=u'password_0')
2018-12-16 21:26:35,654 [INFO] root: 10 users listed
$

[Note that the users are listed in fairly random order. While the CQL Select statment does have an Order By clause, it does not have a run-time component and merely affects how indexes are read.]

And kill Cassandra in the original console with Ctrl-C. Once it has stopped, remove python-cassandra:

$ docker rm python-cassandra

Finally, clean up the data volumes as follows:

$ docker volume prune

Reference

For the details of using Cassandra with Docker:

http://hub.docker.com/_/cassandra/

Cassandra connection, Session and Cluster parameters (including defaults):

http://datastax.github.io/python-driver/api/cassandra/cluster.html

Materialized View Performance Penalty:

http://www.datastax.com/dev/blog/materialized-view-performance-in-cassandra-3-x

[Materialized views seem to be a way of imposing a finer index on stored data. There is a performance penalty.]

Versions

  • Cassandra 3.11.3
  • cassandra-driver 3.16.0
  • lz4 2.1.2
  • pip 18.1
  • python 2.7.12
  • scales 1.0.9

To Do

  • Write Python code
  • Replace print statements with logging
  • Investigate Cassandra Metrics with Python
  • More testing

Credits

There are many fine resources for learning Cassandra. The place to start is:

http://datastax.github.io/python-driver/getting_started.html

[Well worth careful study for the sections on type conversion, consistency level and prepared statements. ]

Also:

http://datastax.github.io/python-driver/installation.html

[For the intricacies of installing the Python driver.]

Releases

No releases published

Packages

No packages published

Languages