Description
Disclaimer: I am not an expert in these two systems. I am just sharing my short analysis and thoughts. If my analysis contains errors, I would be glad if someone points them out!
The README lists Cassandra and ScyllaDB (API-compatible with Cassandra) in the Systems Included
section, but there are no benchmarks for these systems available yet.
TL;DR
How can Cassandra and ScyllaDB, despite their limitations, be benchmarked on ClickBench?
Long version
I am wondering how Cassandra and ScyllaDB could be benchmarked on ClickBench in a way that the results are comparable to other systems?
While there are workarounds for some limitations, e.g., replacing the NOT EQUAL
operator with NOT IN
in predicates, and replacing arithmetic functions and the REGEX_REPLACE
function by user-defined functions (Lua, WASM), I do think that many queries cannot be ran like intended.
Issue 1: There seems to be no support for
COUNT(DISTINCT <column>)
OFFSET <offset>
This affects queries Q5, Q6, Q9, Q10, Q11, Q12, Q14, Q23 and Q39, Q40, Q41, Q42, Q43, respectively.
Furthermore, there are also limitations on ORDER BY
.
Issue 2: Queries containing an
ORDER BY
need to be restricted by anEQ
orIN
on the partition key.
Some queries use derived fields in the ORDER BY
. Hence, these queries may not be runnable on these systems at all, or at least not in the intended way with on-the-fly calculation of the derived fields. This affects queries Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17, Q19, Q22, Q28, Q29, Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43.
In general, Cassandra and ScyllaDB seem to heavily rely on the primary/partition/clustering key for data querying. Data modeling should be (has to be? 🤔) done according to the expected query patterns. In case of heterogeneous query patterns, tables need to be duplicated and stored with different primary/partition/clustering keys.
In addition to the restrictions on ORDER BY
, there are also restrictions on GROUP BY
.
Issue 3:
GROUP BY
arguments must form a prefix of the primary key.
For example, Q17 and Q18 (GROUP BY UserID, SearchPhrase
) could be ran on the same table by setting the primary key accordingly (at least in theory, because the combination of UserID, SearchPhrase
might not be unique). However, then it is impossible to run Q9 and Q10 (GROUP BY RegionID
), Q11 (GROUP BY MobilePhoneModel
), Q12 (GROUP BY MobilePhone, MobilePhoneModel
), etc. on the same table.
Any thoughts on this? Am I missing something?