Skip to content

[Discussion] How should Cassandra and ScyllaDB (API-compatible with Cassandra) be benchmarked? #384

Closed
@michaelkoepf

Description

@michaelkoepf

Disclaimer: I am not an expert in these two systems. I am just sharing my short analysis and thoughts. If my analysis contains errors, I would be glad if someone points them out!

The README lists Cassandra and ScyllaDB (API-compatible with Cassandra) in the Systems Included section, but there are no benchmarks for these systems available yet.

TL;DR

How can Cassandra and ScyllaDB, despite their limitations, be benchmarked on ClickBench?

Long version

I am wondering how Cassandra and ScyllaDB could be benchmarked on ClickBench in a way that the results are comparable to other systems?

While there are workarounds for some limitations, e.g., replacing the NOT EQUAL operator with NOT IN in predicates, and replacing arithmetic functions and the REGEX_REPLACE function by user-defined functions (Lua, WASM), I do think that many queries cannot be ran like intended.

Issue 1: There seems to be no support for

This affects queries Q5, Q6, Q9, Q10, Q11, Q12, Q14, Q23 and Q39, Q40, Q41, Q42, Q43, respectively.

Furthermore, there are also limitations on ORDER BY.

Issue 2: Queries containing an ORDER BY need to be restricted by an EQ or IN on the partition key.

Some queries use derived fields in the ORDER BY. Hence, these queries may not be runnable on these systems at all, or at least not in the intended way with on-the-fly calculation of the derived fields. This affects queries Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17, Q19, Q22, Q28, Q29, Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43.

In general, Cassandra and ScyllaDB seem to heavily rely on the primary/partition/clustering key for data querying. Data modeling should be (has to be? 🤔) done according to the expected query patterns. In case of heterogeneous query patterns, tables need to be duplicated and stored with different primary/partition/clustering keys.

In addition to the restrictions on ORDER BY, there are also restrictions on GROUP BY.

Issue 3: GROUP BY arguments must form a prefix of the primary key.

For example, Q17 and Q18 (GROUP BY UserID, SearchPhrase) could be ran on the same table by setting the primary key accordingly (at least in theory, because the combination of UserID, SearchPhrase might not be unique). However, then it is impossible to run Q9 and Q10 (GROUP BY RegionID), Q11 (GROUP BY MobilePhoneModel), Q12 (GROUP BY MobilePhone, MobilePhoneModel), etc. on the same table.


Any thoughts on this? Am I missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions